python正則BeautifulSoup模組

字元

含義abc

乙個abc

[…]匹配**現的任意乙個字元

[0-9]

表示匹配0-9任意乙個數字

(abc|李四|小紅)

表示匹配abc或李四或小紅

(abc|李四|小紅)

表示匹配abc或李四或小紅2次或3次

^abc

表示要匹配的字串必須要以a開頭

abc$

表示要匹配的字串必須要以c結尾

元字元符號

含義.任意乙個字元

\d乙個數字

\s乙個空格

\b單詞邊界(單詞的左邊或右邊有空格的)

\w乙個字母.數字.下劃線甚至乙個中文

或\d乙個數字之外的字元/ 大寫取和小寫相反的

量詞符號

含義a表示aaa

d表示dd或ddd或dddd

表示乙個字元要匹配n-m次

+表示,一次或多次

*表示,任意次,0次或多次

?表示,0次或1次

import re #  匯入模組
str=
"a123b456b789b"
reg = re.
compile
('[a-za-z]\d'
)#  建立正則規則
# print(re.search(reg, str))  #  查詢字元存不存在,不存在會返回none
result = re.findall(reg,
str)
#  把結果找出來,並返回乙個列表
print
(result)
#  貪婪模式
reg2 = re.
compile
('a.+b'
)print
(re.findall(reg2,
str)
)#  這會匹配到全部的字串
#  非貪婪模式
reg3 = re.
compile
('a.+?b'
)print
(re.findall(reg3,str3)
)#  這只會匹配到"a123b"

re模組的貪婪模式會盡可能匹配多的字串

import  requests #  1-匯入模組
import chardet #  用於獲取頁面原始碼的編碼形式
url =
""#  2-準備**
user_agent =[,
]#  遊覽器頭部資訊
res = requests.get(url, headers=
)#  3-訪問**和新增訪問遊覽器時的頭部資訊
encode = chardet.detect(res.content)
#  獲取頁面的編碼形式,返回的是乙個字典
res.encoding = encode.get(
'encoding'
)#  設定編碼
print
(res.text)
#  4-顯示頁面純文字資訊

get()方法的常用引數:

url:這是要傳送請求的網頁鏈結

headers:這是訪問網頁時的遊覽器頭部資訊,必須是乙個字典,這是可選的

proxies:這個引數可以設定**ip ,必須是乙個字典,這是可選的

from bs4 import beautifulsoup #  匯入模組
html =
"""  
the dormouse's storyaaaaa
the dormouse's storya
once upon a time there were three little sisters; and their names were  
,  lacie and  
tillie;
tillie;  
tillie;  
and they lived at the bottom of a well.
...    
"""#  轉為bs4物件
soup = beautifulsoup(html,
'html.parser'
)#  ***找第乙個yy標籤
# print(soup.title)
# print(soup.find("title"))
## #  標籤的文字內容
# print(soup.find("title").text)
## #  根據id找物件
# ul = soup.find(id="ulone")
# print(ul)
## #  根據class找物件,引數
# print(soup.find(class_='title'))
# #  細化查詢
# print(soup.find('p', class_='title'))
## #  通用寫法,字典中可以有多個條件
# print(soup.find('p',attrs=))
a = soup.find(class_=
'story'
).find(
'a')
#  顯示a物件的所有屬性和屬性值,這是乙個字典
print
(a.attrs)

find()方法會找到第乙個和它匹配的標籤,且只會找乙個

如果想找多個,請用find_all()方法,它是查詢所有的匹配項,並返回乙個列表

使用 requests+bs4 抓取彼岸桌布鏈結，並儲存到csv檔案

import  requests #  1-準備工具
import chardet
import random
from bs4 import beautifulsoup
import  csv
defget_html
(url)
:    user_agent =[,
]# 遊覽器頭部資訊
res = requests.get(url, headers=
)# 3-訪問**和新增訪問遊覽器時的頭部資訊
encode = chardet.detect(res.content)
#  獲取網頁的編碼形式
res.encoding = encode.get(
'encoding'
)# 設定編碼
return res
if __name__ ==
"__main__"
:    url =
""html = get_html(url)
.text
soup = beautifulsoup(html,
"html.parser"
)    img_list = soup.find(
'div'
, class_=
'list'
).find_all(
'img'
)list=[
]for i in img_list:
list
[i.attrs[
'src']]
)#  將鏈結寫入csv
with
open
("桌布鏈結.csv"
,"w"
, encoding=
"utf-8"
, newline="")
as f:
writer = csv.writer(f)
writer.writerows(
list
)

python正則 python正則表達

正規表示式是一種用來匹配字串的強有力的設計思想是用一種描述性的語言來給字串定義乙個規則，凡是符合規則的字串，就認為它匹配否則就不匹配。一可以通過幾類符號設計限定規則，常用的思想如下匹配除換行符以外的任意字元 w匹配字母數字下劃線或漢字 w匹配字母數字下劃線或漢字以外的字元 s匹配任...

python正則空格 python正則

d 匹配乙個數字 w 匹配乙個字母或數字匹配任意字元表示任意個字元包括0個表示至少乙個字元表示0個或1個字元表示行的開頭表示行的結束 s 匹配乙個空格也包括tab等空白符 s 表示至少有乙個空格 a b可以匹配a或b，所以 p p ython可以匹配 python 或者 python...

python正則 python中正則匹配

寫時候，不管是爬蟲，還是獲取某些特定的資源，我們需要寫正規表示式。因為不常用，有些語法生疏。有時明明覺得自己的語法可以，可就是不行。正規表示式是一種文字模式，包括普通字元例如，a 到 z 之間的字母和特殊字元稱為元字元不管是python還是shell,都可以寫正則。正規表示式基本語法頭...

python正則BeautifulSoup模組

python正則 python正則表達

python正則 空格 python正則

python正則 python中正則匹配

相關推薦

python正則空格 python正則