Python爬蟲用正規表示式爬取小說內容

import requests
import re
import json
from requests.exceptions import requestexception
defget_one_page
(url)
:try
:        headers =
# 設定**伺服器
response = requests.get(url, headers=headers)
if response.status_code ==
200:
response.encoding =
'utf8'
# 如果不加這條語句，輸出爬取的資訊為亂碼！
# 爬取下來的編碼是iso-8859-1格式，需要轉化為utf-8格式，加一句response.encoding = 「utf8」
return response.text
return
none
except requestexception:
return
none
"""try-except**塊；
如果try**塊中的**執行起來沒有問題，python將跳過except**塊；
如果try**塊中的**導致了錯誤，python將查詢這樣的except**塊，並執行其中的**，即其中指定的錯誤與引發的錯誤相同
"""def
parse_one_page
(html)
:    pattern = re.
compile
('(.*?).*?(.*?)
.*?"acticlebody">.*?(.*?)
', re.s)
items = re.findall(pattern, html)
# findall()匹配的是正規表示式()裡的內容，如果有返回結果的話，就是列表型別。有(),則只返回()裡的內容；沒有()，則返回正規表示式的內容。
# items = re.search(pattern, html)      # search()匹配的是正規表示式的內容
for item in items:
yield
# print(item)
defwrite_to_file
(content)
:"""寫入檔案"""
with
open
('lingwufengshen.txt'
,'a'
, encoding=
'utf-8'
)as f:
# 實參'a',以便將內容附加到檔案末尾,而不是覆蓋檔案原來的內容；'w'會覆蓋檔案原來的內容
print
(type
(json.dumps(content)))
f.write(json.dumps(content, ensure_ascii=
false)+
'\n'
)# json.loads（） 是將字串傳化為字典；json.dumps (） 是將字典轉化為字串
# 通過json 庫的dumps()方法實現字典的序列化,並指定ensure_ascii引數為false,保證輸出結果是中文形式而不是unicode編碼
defmain()
:    url =
''html = get_one_page(url)
# print(html)
for item in parse_one_page(html)
:        write_to_file(item)
if __name__ ==
'__main__'
:# __name__是內建變數，可用於表示當前模組的名字
# if __name__=='__main__'下的**只能被自己直接執行，不能被其他程式import呼叫執行；
main(
)# 在network的response中檢視原始請求得到的原始碼

python爬蟲正規表示式

正規表示式是十分高效而優美的匹配字串工具，一定要好好掌握。利用正規表示式可以輕易地從返回的頁面中提取出我們想要的內容。1 貪婪模式與非貪婪模式 python預設是貪婪模式。貪婪模式，總是嘗試匹配盡可能多的字元非貪婪模式，總是嘗試盡可能少的字元。一般採用非貪婪模式來提取。2 反斜槓問題正規表示式裡...

Python爬蟲正規表示式

一般的正規表示式都可直接到正則生成工具處生成，常見匹配字元 re.match及其常規匹配 re.match 嘗試從字串的起始位置匹配乙個模式，如果不是起始位置匹配成功的話，match 就返回none。re.match pattern,string,flags 0 返回的為乙個物件，其中span代表長...

Python 爬蟲正規表示式

常見的正則字元和含義如下匹配任意字元，除了換行符匹配字串開頭匹配字串末尾匹配括號內表示式，也表示乙個組 s 匹配空白字元 s 匹配任何非空白字元 d 匹配數字，等價於 0 9 d 匹配任何非數字，等價於 0 9 w 匹配字母數字，等價於 a za z0 9 w 匹配非字母數字，等價於 a z...

Python爬蟲 用正規表示式爬取小說內容

python爬蟲 正規表示式

Python爬蟲 正規表示式

Python 爬蟲 正規表示式

相關推薦

Python爬蟲用正規表示式爬取小說內容

python爬蟲正規表示式

Python爬蟲正規表示式

Python 爬蟲正規表示式