python網路爬蟲入門

from urllib import request
fp=request.urlopen("")
content=fp.read()
fp.close()

這裡需要使用可以從html或者xml檔案中提取資料的python庫，beautiful soup

安裝該庫：

pip3 install beautifulsoup4

from bs4 import beautifulsoup
soup = beautifulsoup(html)
for x in soup.findall(name='a')： # 找出所有的a標籤
print('attrs:',a.attrs) # 輸出a標籤的屬性
#利用正則,找出所有id=link數字 標籤
for a in soup.findall(attrs=)
print(a)

可以寫入檔案，也可以做進一步處理，例如清洗

示例**如下：

from bs4 import beautifulsoup
from urllib import request
import re
import chardet
import sys
import io
#改變標準輸出的預設編碼
fp=request.urlopen("")
html=fp.read()
fp.close()
# 判斷編碼方式
det = chardet.detect(html)
# 使用該頁面的編碼方式
soup = beautifulsoup(html.decode(det['encoding']))
# 找出屬性為href=http或者href=https開頭的標籤
for tag in soup.findall(attrs=):
print(tag)
with open(r'c:\users\van\desktop\test.csv', 'a+') as file:
content = tag.attrs['href'] + '\n'
file.write(content) #寫入檔案

Python網路爬蟲入門（四）

beautifulsoup庫 from bs4 import beautifulsoup html soup beautifulsoup html,lxml 列印所有的tr標籤 trs soup.find all tr for tr in trs print tr 獲取第二個tr標籤 tr soup...

python網路爬蟲入門（二）

一 python爬取10頁250條資料中的所有書單模組案例方法一 encoding utf 8 import requests from bs4 import beautifulsoup i 25 while i 225 i i 25 c str i resp requests.get c so...

Python網路爬蟲入門介紹

我們最常見的就是post和get請求，學習完這兩個模組就可以爬去大部分網頁了。我們所有的高階爬蟲都是基於基本的請求傳送的，因此理解和熟練掌握這些基本的技能是尤為重要的。下面列舉常見的傳送請求的方式利用requests和urllib傳送get請求利用requests和urllib傳送post請求 ...

python網路爬蟲入門

Python網路爬蟲入門（四）

python網路爬蟲入門（二）

Python網路爬蟲入門介紹

相關推薦