Beautiful Soup解析工具簡介

1.html解析器有很多種，比如：

解析工具

解析速度

難度

beautifulsoup

最慢最簡單

lxml快簡單

正則最快

最難2.beautiful soup解析工具的官方文件鏈結。

2.1獲取所有"a"標籤、2.2獲取第2個"a"標籤、2.3獲取class='ulink'的"a"標籤、2.4獲取滿足多個條件的"a"標籤、2.5獲取所有"a"標籤的href屬性、2.6獲取純文字text資訊。示例**如下：

# coding:utf-8
import requests
from lxml import etree
from bs4 import beautifulsoup
import chardet
base_domain = ""
headers = 
def get_detailed_urls(url):
# 1.獲取網頁資訊
response = requests.get(url, headers=headers)
# 檢視網頁後發現，編碼方式為「gb2312」charset
encode_style = chardet.detect(response.content)["encoding"]
# text = response.content.decode(encode_style, "ignore")
text = response.content.decode("gbk", "ignore")
# 2.對獲取的text進行解析,解析成元素
soup = beautifulsoup(text, "lxml")
# 2.1獲取所有"a"標籤
# all_a = soup.find_all("a")
# for i in all_a:
#     print i
#     # tag型別
#     # print type(i)
#     # from bs4.element import tag
# 2.2獲取第2個"a"標籤
# all_a = soup.find_all("a", limit=2)[1]
# print all_a
# 2.3獲取class='ulink'的"a"標籤
# # 方法一
# # all_a = soup.find_all("a", class_="ulink")
# # 方法二
# all_a = soup.find_all("a", attrs=)
# for i in all_a:
#     print i
# 2.4獲取滿足多個條件的"a"標籤
# 方法一
# all_a = soup.find_all("a", class_="ulink", href="/html/gndy/dyzz/20180605/56940.html")
# 方法二
# all_a = soup.find_all("a", attrs=)
# for i in all_a:
#     print i
# 2.5獲取所有"a"標籤的href屬性
# all_a = soup.find_all("a")
#     # for a in all_a:
#     #     # 方法一：通過下標的方式
#     #     # href = a["href"]
#     #     # print href
#     #     # 方法二：通過attrs屬性的方式
#     #     href = a.attrs["href"]
#     #     print href
# 2.6獲取純文字text資訊
all_a = soup.find_all("td", attrs=)[1:]
for a in all_a:
# 方法一：a.string
# print a.string
# print "="*30
# 方法二：a.strings
# infos = a.strings
# for info in infos:
#     print info
#     print "=" * 30
# 方法二：a.strings
# infos = list(a.strings)
# print infos
# 方法三：a.stripped_strings
# infos = a.stripped_strings
# for info in infos:
#     print info
#     print "=" * 30
# 方法四：a.get_text()
# infos = a.get_text()
# print infos
def spider():
# 1.獲取第二頁詳細url
# url = "/html/gndy/dyzz/index.html"
base_url = "/html/gndy/dyzz/list_23_{}.html"
for i in range(1, 8):
url = base_url.format(i)
get_detailed_urls(url)
break
if __name__ == '__main__':
spider()

BeautifulSoup解析資料

4 基本操作 coding utf 8 author wengwenyu from bs4 import beautifulsoup fp open soup text.html encoding utf 8 soup beautifulsoup fp,lxml print soup 根據標籤名進行...

資料解析 BeautifulSoup

bs4資料解析例項化乙個beautifulsoup物件，並且將頁面遠嗎載入到該物件中。通過呼叫beautifulsoup物件中相關屬性方法進行標籤定位，資料提取。pip install bs4 pip install lxml 解析器下面介紹乙個是從本地html文件中載入beautifulsou...

BeautifulSoup解析xml檔案的使用初步

借助拉手網的開放api藉口，獲取特定城市的當日資料列印響應獲取每個店鋪的短標題和購買數量 print each.data.display.shorttitle.text,each.data.display.bought.text if name main fetch 沒有和etree.elem...

Beautiful Soup解析工具簡介

BeautifulSoup解析資料

資料解析 BeautifulSoup

BeautifulSoup解析xml檔案的使用初步

相關推薦