python之Bs4的使用

bs4是乙個可以幫助我們快速解析文件，獲取我想要的標籤和內容的第三方庫

# beautifulsoup用來解析html等文字格式 # 引入bs4和re )# html.parser指定乙個解析器，代表解析html文件嗎，因為beautifulsoup不僅僅可以解析html，其他的一些文件格式也可以解析 bs = beautifulsoup (html, "html.parser"

)

獲取標籤,標籤的內容及屬性

print
(bs.title)  # 獲取網頁
<
/title>
print
(bs.a)  # 獲取網頁
<
/a>
print
(bs.head)  # 獲取網頁
<
/head>
# 獲取標籤裡面的內容
print
(bs.title.string)
print
(bs.a.string)
# 獲取標籤的所有屬性
print
(bs.a.attrs)

文件遍歷

print
(bs.head.contents)
for i in
range
(len
(bs.head.contents)):
item=bs.head.contents[i]
print
(item.string)
# 1. 文件的搜尋 find_all
# 字串過濾，將文件所有的a標籤搜尋出來，但是只能搜尋
<
/a>標籤，
<
/adc>標籤就不可以
taglist = bs.
find_all
("a"
)print
(taglist)
# 將a標籤的前三條搜尋出來 用limit=
3taglist = bs.
find_all
("a"
, limit=3)
print
(taglist)

正規表示式搜尋

# re用來寫正規表示式的規則
import re
# 搜尋出所有包含a的標籤
compile
()   #裡面傳入的是正規表示式的規則
c_list = bs.
find_all
(re.
compile
("a"))
print
(c_list)

方法搜尋，可以傳入乙個函式

# has_attr方法代表：如果物件有該屬性返回 true，否則返回 false。
def name_is_has
(tag)
:return tag.
has_attr
("name"
)# 將所有包含name屬性的標籤查詢出來
h_list = bs.
find_all
(name_is_has)
print
(h_list)

引數匹配搜尋

# id=head的標籤搜尋出來
i_list = bs.
find_all
(id=
"head"
)# 因為class是關鍵之 所有這裡使用class_  ,所有有class屬性的標籤搜尋出來
i_list=bs.
find_all
(class_=true)
# 將所有標籤帶有 href=
"" 屬性的搜尋出來
i_list=bs.
find_all
(href=
"")for item in i_list:
print
(item)

text引數搜尋

# 將所有匹配的文字搜尋出來
t_list = bs.
find_all
(text=
"貼")
print
(t_list)
# 可以傳多個text引數
t_list = bs.
find_all
(text=
["hao123"
,"新聞"
,"貼吧"])
print
(t_list)

css 選擇器查詢 select

# 將所有的
<
/title>查詢出來
c_list=bs.
select
('title'
)# 將所有的class
=s-top-more  查詢出來
c_list=bs.
select
(".s-top-more"
)# 將所有的id=u1的標籤查詢出來
c_list = bs.
select
("#u1"
)# 將所有div包含class
='guide-info' 屬性的標籤查詢出來
c_list=bs.
select
("div[class='guide-info']"
)# 通過子標籤來查詢
c_list=bs.
select
("head>title"
)# 獲取該標籤的文字
msg=c_list[0]
.get_text()
print
(msg)
print
(c_list)
# 還有其他選擇器，比如兄弟選擇器等等也可以查詢

python爬蟲之bs4的基本使用

beautifulsoup可以很方便地查詢html標籤以及其中的屬性和內容 import requests from bs4 import beautifulsoup 建立beautifulsoup物件當資料為本地檔案時 file open html soup beautifulsoup fil...

python爬蟲資料解析之bs4

步驟 1 匯入bs4庫 from bs4 import beautifulsoup2 獲取soup物件 html為你獲取的網頁源將html轉化為特定的格式lxml 為後面提取資訊做準備 soup beautifulsoup html,lxml 3 利用方法選擇器解析 find all 查詢所有符合...

爬蟲架構 bs4

方便解析html xml等格式的原始碼，快速查詢修改等操作，節省數小時乃至更多的工作時間官網文件 from bs4 import beautifulsoup print path beautifulsoup path 非真實網頁 html doc 夏日炎炎，要你幹嘛 print soup.hea...

python之Bs4的使用

python爬蟲之bs4的基本使用

python爬蟲資料解析之bs4

爬蟲架構 bs4

相關推薦