一:bs4的功能與使用——成功
from bs4 import beautifulsoup
import requests
r = requests.get(
'')demo = r.text
soup = beautifulsoup(demo,
'html.parser'
)print
(soup.prettify())
# 有層次感的輸出解析後的html頁面
tag = soup.a
print
(tag.attrs)
print
(tag.attrs[
'class'])
print
(type
(tag.attrs)
)print
(soup.a.prettify())
newsoup = beautifulsoup(
'我明白了bs4的使用'
,'html.parser'
)print
(newsoup.prettify())
print
(soup.contents)
二:使用re爬取**——失敗
import requests
import re
defgethtmltext
(url)
:try
: kv =
r = requests.get(url, timeout =
30, headers = kv)
r.raise_for_status(
) return r.text
except
:return
"爬取失敗"
defparsepage
(glist, html)
:try
: price_list = re.findall(r''
, html)
name_list = re.findall(r''
, html)
for i in
range
(len
(price_list)):
price =
eval
(price_list[i]
.split(
":")[1
])name =
eval
(name_list[i]
.split(
":")[1
])[price, name]
)except
:print
("解析失敗"
)def
printgoodlist
(glist)
: tplt =
"\t\t"
print
(tplt.
format
("序號"
,"商品**"
,"商品名稱"))
count =0;
for g in glist:
count = count +
1print
(tplt.
format
(count, g[0]
, g[1]
))goods_name =
"書包"
start_url =
""+ goods_name
info_list =
page =
3count =
0for i in
range
(page)
: count +=
1try
: url = start_url +
"&s="
+str(44
* i)
html = gethtmltext(url)
parsepage(info_list, html)
print
("\r爬取頁面當前進度: %"
.format
(count *
100/ page)
, end ="")
except
:continue
printgoodlist(info_list)
失敗原因:頁面的cookie獲取有誤(但不清楚填充自己的cookie時**出現了錯誤)
三:lxml爬取丁香園論壇——失敗
from lxml import etree
import requests
url =
""req = requests.get(url)
html = req.text
tree = etree.html(html)
print
(tree)
user = tree.xpath('')
content = tree.xpath('')
results =
for i in
range(0
,len
(user)):
.strip()+
": "
+ content[i]
.xpath(
'string(.)'
).strip())
for i, result in
zip(
range(0
,len
(user)
), results)
:print((
"user"
+str
(i +1)
+"-"
+ result)
)print
("*"
*100
)
失敗原因:
這個xpath方法使用有誤麼?先記錄下哈哈
Python爬蟲入門(2) 爬蟲基礎了解
爬蟲,即網路爬蟲,大家可以理解為在網路上爬行的一直蜘蛛,網際網路就比作一張大網,而爬蟲便是在這張網上爬來爬去的蜘蛛咯,如果它遇到資源,那麼它就會抓取下來。想抓取什麼?這個由你來控制它咯。比如它在抓取乙個網頁,在這個網中他發現了一條道路,其實就是指向網頁的超連結,那麼它就可以爬到另一張網上來獲取資料。...
python爬蟲入門訓練 2
這次的爬蟲訓練是對豆瓣top250資訊的爬取,比較簡單的靜態頁面的爬取,本人也是初學者,為了防止學習的困難,我盡量寫的詳細點,建議先把 複製一遍,看能不能成功執行,再過來看,免得到時候全部看完了,不能執行,到時候自己解決也是蠻麻煩的,畢竟爬蟲更新換代也是蠻快的 對豆瓣top250所有資訊進行爬取,包...
爬蟲入門 2 BeautifulSoup庫
beautifulsoup拓展包安裝 pip3 install beautifulsoup4 default timeout 1000beautifulsoup簡介 beautifulsoup是乙個html xml的解析器,主要功能是解析和提取html xml中的資料。beautifulsoup支援...