爬蟲入門 2 BeautifulSoup庫

beautifulsoup拓展包安裝

pip3 install beautifulsoup4 --default-timeout=1000

beautifulsoup簡介

beautifulsoup是乙個html/xml的解析器，主要功能是解析和提取html/xml中的資料。

beautifulsoup支援python標準庫中的html解析器，也支援一些第三方的解析器。如果我們沒有進行額外的安裝，使用的就是python預設是解析器。lxml解析器更加強大，速度更快，推薦使用lxml。

1、提取網頁中的純文字

r = requests.get(
'')bf = beautifulsoup(r.text,features=
'html.parser'
)# 按照標準縮排格式輸出html
bf.prettify(
)# 消去html標籤項，只輸出純文字
bf.get_text(
)

2、提取標籤中的內容

bf = beautifulsoup(r.text,features=
'html.parser'
)# 使用select提取所有a標籤的元素,返回結果是乙個列表
bf.select(
'a')
# 找出所有id為title的元素(id前面須加#)
bf.select(
'#title'
)# 找出所有class為link的元素(class前面須加.)
bf.select(
'.link'
)# 找出所有class=mask的span元素(裡面也可以指定id)
bf.select(
'span[class=mask]'
)# 找出所有li元素裡面的a標籤
bf.select(
'li a'
)

使用beautifulsoup嘗試提取網頁文字內容

1、compile方法

首先我們介紹一下後面會用到的compile方法。

compile函式用於編譯正規表示式，返回乙個正規表示式物件，供其他函式使用。

>>
>
import re
>>
> s = re.
compile
('[a]+'
)>>
> string =
'aaa1123sass'
>>
>
list
= s.split(string)
>>
>
list[''
,'1123s'
,'ss'
]>>
> list2 = s.findall(string)
>>
> list2
['aaa'
,'a'
]#其他函式如findall或split等使用compile返回的正規表示式物件s的方法是s.其他函式(字串)

2、嘗試使用beautifulsoup提取網頁純文字內容

試著用beautifulsoup提取華理官網815考試大綱。只掌握爬取純文字內容的話，很難將大綱的內容從眾多的文字中提取出來。學習完通過標籤等方式提取內容之後，就可以輕鬆地進行分離了。下面的**所有的文字是混在一起的。

import requests
import re
from bs4 import beautifulsoup
headers=
r = requests.get(
'',headers = headers)
r.encoding =
'utf-8'
if(r.status_code ==
200)
:    bf = beautifulsoup(r.text,features=
"html.parser"
)#按照標準縮排格式輸出
#print(bf.prettify())
#將html的標籤清除，只返回純文字
text = bf.get_text(
)#使用compile消除換行，使返回內容更加美觀
re = re.
compile
('[\n]+'
)list
= re.split(text)
with
open
('txt/815.txt'
,'a'
, encoding=
'utf-8'
)as f:
f =open
('txt/815.txt'
,'w'
)        f.truncate(
)for x in
list
:print
(x)            f.write(x +
'\n'
)else
:print
('爬取網頁失敗'
)

主要需要掌握select和findall的用法。

import requests

from bs4 import beautifulsoup

headers=

r = requests.get(

'',headers=headers)

r.encoding =

'utf-8'

bf = beautifulsoup(r.text,features=

'html.parser'

)title = bf.select(

'.news-1 a'

)for x in bf.select(

'.news-1 a'

)+bf.select(

'.news-2 a'):

#新聞時間並沒有放在當前頁，所以需要我們進入子頁面進行查詢

#但是這樣有乙個缺點就是for迴圈的每一次都需要對乙個網頁進行訪問速度超級慢

rx = requests.get(x[

'href'

],headers=headers)

rx.encoding =

'utf-8'

bfx = beautifulsoup(rx.text,features=

'html.parser'

) time = bfx.select(

'.date')[

0].text

text = x.text

(time,x.text,x[

'href'])

爬蟲入門 2 BeautifulSoup庫

python 網路爬蟲 beautifulsoup

爬蟲入門 2

爬蟲處理資料的方式（三）BeautifulSoup

爬蟲入門 2 BeautifulSoup庫

python 網路爬蟲 beautifulsoup

爬蟲入門 2

爬蟲處理資料的方式（三）BeautifulSoup

相關推薦