BeautifulSoup解析非標準HTML的問題

beautifulsoup版本：4.3.2

在用beautifulsoup.find_all()搜尋html時，遇到下面的**：

<
a href
="/shipin/donghuapian/2012-07-25/23404.html"
title
="謙謙君子"
target
="_blank"
>溫潤如玉
a>

可以看出**中a標籤的href屬性和title屬性之間沒有空格。

通過beautifulsoup的診斷工具（4.2版以上才有）diagnose：

from bs4.diagnose import
diagnose
html_doc = open('
test.html
').read()
diagnose(html_doc)

發現那行**被解析成：

<
a href
="/shipin/donghuapian/2012-07-25/23404.html"
> title="謙謙君子" target="_blank">
溫潤如玉
a>

看出來了嗎？這是個錯誤的a標籤，包含title和target位置出現錯誤，造成beautifulsoup.find_all()解析到此行**時，匹配title就會失敗。

問題出現的原因是beautifulsoup預設使用python自帶的html parser，對錯誤網頁的相容性不強。

為beautifulsoup指定乙個新的html parser，

這裡有詳情，我選擇了lxml：

sudo pip install lxml

建立beautifulsoup物件時，新增乙個引數：

#coding=utf-8
import re
from bs4 import beautifulsoup
html_doc = open('
test.html
').read()
soup = beautifulsoup(html_doc, '
lxml
')　　# 選擇lxml作為新的html parser。
tags = soup.find_all('
a', )

就ok了。

BeautifulSoup解析資料

4 基本操作 coding utf 8 author wengwenyu from bs4 import beautifulsoup fp open soup text.html encoding utf 8 soup beautifulsoup fp,lxml print soup 根據標籤名進行...

資料解析 BeautifulSoup

bs4資料解析例項化乙個beautifulsoup物件，並且將頁面遠嗎載入到該物件中。通過呼叫beautifulsoup物件中相關屬性方法進行標籤定位，資料提取。pip install bs4 pip install lxml 解析器下面介紹乙個是從本地html文件中載入beautifulsou...

BeautifulSoup解析xml檔案的使用初步

借助拉手網的開放api藉口，獲取特定城市的當日資料列印響應獲取每個店鋪的短標題和購買數量 print each.data.display.shorttitle.text,each.data.display.bought.text if name main fetch 沒有和etree.elem...

BeautifulSoup解析非標準HTML的問題

BeautifulSoup解析資料

資料解析 BeautifulSoup

BeautifulSoup解析xml檔案的使用初步

相關推薦