1 6 網頁解析器beautifulsoup

beautifulsoup介紹

beautifulsoup實戰

為了實現解析器，可以選擇使用

1. 正規表示式、

2. html.parser、

3. beautifulsoup、

4. lxml等，

這裡我們選擇beautifulsoup。

其中，正規表示式基於模糊匹配，而另外三種則是基於dom結構化解析。而且beautifulsoup可以引用另外兩個，所以更強大。

結構化解析是www官方的解析方法，是以樹的形式對文件進行逐層解析。

pip install beautifulsoup4

使用示意圖：

建立beautifulsoup物件

from bs4 import beautifulsoup
html_doc = """  #   定義乙個長字串，儲存html**
the dormouse's storytitle>
head>
class="title">
the dormouse's storyb>
p>
class="story">once upon a time there were three little sisters; and their names were
href=""
class="sister"
id="link1">elsiea>,
href=""
class="sister"
id="link2">laciea> and
href=""
class="sister"
id="link3">tilliea>;
and they lived at the bottom of a well.p>
class="story">...p>
"""soup = beautifulsoup(html_doc,'html.parser',from_encoding='utf-8')
# 三個引數分為為：傳入的html字串；使用的解析器；編碼方式 
print(soup.prettify())

搜尋節點（從dom樹中）

find與find_all的引數都是這三個：

1.html標籤名 2.標籤屬性 3.標籤裡面的內容

匹配到了，之後如何輸出？

仔細研讀！

# coding:utf-8
from bs4 import beautifulsoup
import re
html_doc = """  #   定義乙個長字串，儲存html**
the dormouse's story
once upon a time there were three little sisters; and their names were
elsie,
lacie and
tillie;
and they lived at the bottom of a well.
..."""
soup = beautifulsoup(html_doc, 'html.parser')
# 建立bs物件，其三個引數分為為：傳入的html字串；使用的解析器；編碼方式
print('提取所有的連線出來')
links = soup.find_all('a')
for link in links:
print(link.name, link['href'], link['class'], link.get_text())
print('獲取lacie連線')
link_node = soup.find('a', href='')
print(link_node.name, link_node['href'], link_node.get_text())
print('用強大的 正則語法 匹配 ！！！！！！！！')
link_node = soup.find('a', href=re.compile(r"ill"))
print(link_node.name, link_node['href'], link_node.get_text())
print('獲取p標籤')
p_node = soup.find('p', class_=re.compile(r"ti"))
print(p_node.name, p_node['class'], p_node.get_text())
print('抓取暗鏈結')
link_node = soup.find_all(style='display:none;')
print(link_node)
# print(link_node.name, link_node.get_text())
# 這裡發現  如果只有乙個暗鏈結，則不能用findall，之恩那個用find，那麼不知道有幾個的情況怎麼辦呢？

03網頁解析器

網頁解析器從網頁中提取有價值資料的工具,也會提取到網頁中所有的url，用於後續的訪問。python網頁解析器 1.正規表示式最直觀，將網頁當作是乙個字串，進行模糊匹配但如果對於較為複雜的文件，會相當複雜 2.html.parser python自帶 3.beautifulsoup 第三方外掛程式...

python爬蟲五網頁解析器

網頁解析器是從網頁中提取有價值資料的工具 python 有四種網頁解析器 1 正規表示式模糊匹配解析 2 html.parser 結構化解析 3 beautiful soup 結構化解析 4 lxml 結構化解析其中 beautiful soup 功能很強大,有html.parse和 lxml...

網頁解析器

學習任務 1.認識網頁解析器學習目標知識目標 1.熟悉網頁解析器能力目標 1.能夠自主編寫網頁解析器獲取新url函式獲取新的url def get new urls self,page url,soup 儲存url new urls set 獲取所有的url 獲取每乙個url new ur...

1 6 網頁解析器beautifulsoup

03網頁解析器

python爬蟲 五 網頁解析器

網頁解析器

相關推薦

python爬蟲五網頁解析器