簡單爬取單頁Spanishdict

網頁排版也很舒服，最初上手爬蟲就是因為想把上面一些詞彙的翻譯爬下來。結果接觸下來，發現爬蟲真的是乙個大坑，填坑填到現在回頭看一下最初的這幾行**，真的是沒有什麼技術含量。

放上來作為乙個時間節點吧。

import requests
from requests.exceptions import requestexception
from bs4 import beautifulsoup
import json
defget_one_page
(url)
:try
:        headers =
response = requests.get(url,headers=headers)
if response.status_code ==
200:
return response.text
return
none
except requestexception:
return
none
defparse_one_page
(html)
:    doc = beautifulsoup(html,
'lxml'
)    items = doc.find_all(
'div'
,class_=
'dictionary-neodict-indent-1'
)    content =
for item in items:
print
(item.text)
return content
defwrite_to_file
(content)
:with
open
('spanishdict.txt'
,'a'
,encoding=
'utf-8'
)as f:
f.write(json.dumps(content,ensure_ascii=
false)+
'\n'
)def
main
(voca)
:    url=
''+str
(voca)
html = get_one_page(url)
content = parse_one_page(html)
write_to_file(content)
if __name__ ==
'__main__'
:for voca in
['ir'
,'venir']:
print
(voca)
main(voca)

多頁爬取資料

beautifulsoup自動將輸入文件轉換為unicode編碼，輸出文件轉換為utf 8編碼。你不需要考慮編碼方式，除非文件沒有指定乙個編碼方式，這時，beautifulsoup就不能自動識別編碼方式。這時，你只需要說明一下原始編碼方式就ok。引數用lxml就可以，需要另行安裝並載入。beauti...

5 簡單python爬蟲爬取新聞頁

python爬蟲例項爬取新聞實現過程，先爬首頁，通過正規表示式獲取所有新聞鏈結，然後依次爬各新聞，並儲存到本地 import urllib.request import re data urllib.request.urlopen read data2 data.decode utf 8 ign...

Python爬取小說 2 單章節爬取

coding utf 8 urlopen 開啟 request 請求 from urllib.request import urlopen,request 匯入gzip包解壓gzip 封裝請求 req request url path,headers headers 開啟鏈結 conn urlop...

簡單爬取單頁Spanishdict

多頁爬取資料

5 簡單python爬蟲 爬取新聞頁

Python爬取小說 2 單章節爬取

相關推薦

5 簡單python爬蟲爬取新聞頁