爬蟲爬取bilibili

1.根據url傳送請求給伺服器，獲取html文字

2. 解析html文字，把需要的資料挑出來

3. 從html中解析出超連結，繼續爬取其中的頁面

爬蟲的協議b站的爬蟲協議尾巴加上 /robots.txt

獲取相應的api

控制抓取的頻率

import requests
url = ''
# 發起網路請求
response = requests.get(url)
print(response.text)

詳細可參考w3school 的html內容

利用beautifulsoup 格式化html傳來的內容

from bs4 import beautifulsoup

提取相應的標籤屬性li class : rank-item 內容

items = soup.find_all(
'li'
,)

抓標題

# 抓標題   標籤/屬性/限定
for itm in items:
title = itm.find(
'a',
).text
i +=
1print
(str
(i)+
" "+ title)

抓取其他字段

# 抓標題   標籤/屬性/限定 抓取up主名稱
for itm in items:
title = itm.find(
'a',
).text
# 得分
score = itm.find(
'div',)
.find(
'div'
).text
# 排名
rank = itm.find(
'div',)
.text
play_number = itm.find(
'span',)
.text
# up主
up_name = itm.find(
'div',)
.find(
'a')
.text
# up 的id號   get 獲得對應屬性
up_id = itm.find_all(
'a')[2
].get(
'href')[
len(
''):
]# 抓取url
up_url = itm.find(
'a',
).get(
'href'
)

# 時間格式化輸出  年月日 時分秒
now_str = datetime.datetime.now(
).strftime(
'%y%m%d_%h%m%s'
)file_name =
'bilibili_top100_'
+ now_str +
'.csv'
# encoding 因為沒有指示編譯型別，無法對四個以上的字元進行解析 注意
with
open
(file_name,
'w', newline=
'', encoding=
'utf-8-sig'
)as f:
# 寫資料進去
pen = csv.writer(f)
pen.writerow(video.csv_title())
for v in vidoes:
pen.writerow(v.to_csv(
))

網路類庫：urlib（比較老），requests解析類庫： beautifulsoup

框架： scrapy

js 渲染

封ip處理

Python 爬蟲爬取網頁

工具 python 2.7 import urllib import urllib2 defgetpage url 爬去網頁的方法 request urllib.request url 訪問網頁 reponse urllib2.urlopen request 返回網頁 return response...

爬蟲之小說爬取

以筆趣閣為例，爬取一念永恆這本具體如下 1 from bs4 import beautifulsoup 2from urllib import request 3import requests 4importre5 import sys6 def down this chapter chapt...

python爬蟲爬取策略

在爬蟲系統中，待抓取url佇列是很重要的一部分。待抓取url佇列中的url以什麼樣的順序排列也是乙個很重要的問題，因為這涉及到先抓取那個頁面，後抓取哪個頁面。而決定這些url排列順序的方法，叫做抓取策略。下面重點介紹幾種常見的抓取策略一深度優先遍歷策略深度優先遍歷策略是指網路爬蟲會從起始頁開始...

爬蟲爬取bilibili

Python 爬蟲爬取網頁

爬蟲之小說爬取

python爬蟲爬取策略

相關推薦