網路爬蟲學習（八）

本期來學習下正規表示式，之前學習了requests模組(模組相比urllib模組，在很多方面都顯得非常簡潔，可以說是urllib模組的乙個昇華。那麼什麼是正規表示式呢？簡單來說，正規表示式是對字串操作的一種邏輯公式，就是事先定義好的一些特定字元、及這些特定字元的組合，組成乙個「規則字串」，這個「規則字串」用來表達對字串的一種過濾邏輯。正規表示式的應用非常廣泛，而且在很多程式語言中都是必不可少的技術，在目前比較火的人工智慧領域，尤其是自然語言處理領域，這種技術是乙個從業人員必須要掌握的一門技術，熟練的掌握正規表示式，可以給我們的研究工作帶來很大的方便。關於正規表示式的學習，可以參考這個**(

re.match

re.match嘗試從字串的起始位置匹配乙個模式，如果不是起始位置匹配成功的話，match()就返回none。

re.match(pattern,string,flags=0)

基本匹配

import re
content = "hello 123 4567 world_this is a regex demo"
print(len(content))
result = re.match("^hello\s\d\s\d\s\w.*demo$",content)
print(result)
print(result.group()) # 返回匹配的結果
print(result.span())  # 輸出匹配結果的範圍

這是對content內容的乙個完全匹配，這是乙個最基本的匹配。

下面來做乙個泛匹配：

import re
content = "hello 123 4567 world_this is a regex demo"
result = re.match("hello.*demo$",content)
print(result)
print(result.group())
print(result.span())

下面來實現乙個對指定目標的匹配，如果我們想獲得匹配結果中的要獲得的內容可以加個()，比如獲取1234567這個目標結果：

import re
content = "hello 1234567 world_this is a regex demo"
result = re.match("hello\s(\d+)\sworld.*demo$",content)
print(result)
print(result.group(1))     # 把模式中第乙個加了括號的匹配到的內容輸出來
print(result.span())

貪婪匹配，所謂的貪婪匹配指的是，對於給定的模式，盡可能多的匹配內容：

import re
content = "hello 1234567 world_this is a regex demo"
result = re.match("he.*(\d+).*demo$",content)
print(result)
print(result.group(1))

可以看到，只輸出了最後的數字7，這是因為，由於".*"是貪婪匹配，它會把「123456和前面的"hello"匹配到一塊，因為這樣才能保證".*"匹配到最多的字元。而後面的".*"是同樣的道理，(\d+)表示至少要匹配乙個數字，這樣理所當然的只能得到結果為僅乙個數字7。

非貪婪匹配，自然指的是，對於給定的模式，盡可能少的匹配內容：

由於?是非貪婪匹配，而?好之後又是匹配數字，因此非貪婪匹配會盡量少的匹配非數字，而".*"是貪婪匹配，所以?會將第乙個數字1之前的字元匹配到一塊，而group(1)表示要匹配帶第乙個加了括號的內容(即數字)，因此輸出結果為1234567

轉義:對於特殊符號，比如$、？...這類符號要匹配的話，要在符號前面加乙個"\"。

import re
content = "price is $10.00"
result = re.match("price is \$10\.00",content)
print(result)

總之，盡量使用泛匹配、使用括號得到匹配目標、盡量使用非貪婪模式。另外re.match太過依賴於第乙個字元，在使用時不太方便，比如：

import re
content = "machine learning is very interesting"
# 嘗試從中間進行匹配
result = re.match("learning.*",content)
print(result)

由於開頭不匹配，所以返回none

因此，下面來學習re.search

re.search

re.search掃瞄整個字串並返回第乙個成功的匹配。

import re
content = "machine learning is very interesting"
# 嘗試從中間進行匹配
result = re.search("learning.*",content)
print(result)

可以看到，成功匹配並且返回了匹配的結果。

因此，一般最好使用re.search。

re.sub

re.sub的作用是替換字串中每乙個匹配的子串後返回替換後的字串。

import re
content = "computer science is so interesting 太666了。"
new_content = re.sub("\d+","good",content)
print(new_content)

但是，如果要替換的內容包括字串本身，如新替換的內容，是在原來的基礎上新增額外的字元，比如」666「替換為"666 nice"，此時可以進行如下操作：

import re
content = "computer science is so interesting 太666了。"
new_content = re.sub("(\d+)",r'\1 nice',content)
print(new_content)

這裡的'\1'表示把第1個括號中的內容拿過來進行替換。

re.compile

將正則字串編譯成正則物件，以便於復用該匹配模式。

import re
content = """hello 1234567 world_this
is a regex demo"""
# re.s表示「.」（不包含外側雙引號，下同）的作用擴充套件到整個字串，包括「\n」。
pattern = re.compile("hello.*demo",re.s)
result = re.match(pattern,content)
print(result)

總之，正規表示式博大精深，要想融會貫通，所需要做的就是反覆理解和練習。

網路爬蟲學習（八）

Python網路爬蟲學習

Python網路爬蟲學習（二）

網路爬蟲 python學習筆記

網路爬蟲學習（八）

Python網路爬蟲學習

Python網路爬蟲學習（二）

網路爬蟲 python學習筆記

相關推薦