python資料抓取技術與實戰爬蟲基礎

第乙個爬蟲應用

該**主要爬取電子工業出版社首頁的內容。

#引入requests模組
import requests
#定義get_content函式
def get_content(url):
resp = requests.get(url)
return resp.text
#"__name__ == '__main__'"的作用是被別的檔案import時候，以下**不會主動地執行
if __name__ == '__main__':    
#定義url，值為要抓取的目標****
url=""
#呼叫函式返回賦值給content
content = get_content(url)
#列印輸出content的前50個字元
print("前50個字元為："，content[0:50])
#列印輸出content的長度
content_len = len(content)
print("內容長度為："，content_len)
#判斷內容長度是否大於40kb
if content_len >=40 * 1024:
print("內容的長度大於等於40kb")
else:
print("內容的長度小於40kb")

以下是輸出內容：

前50個字元為： 字典推導表示式
urls_d = ".format(i) for i in range(1,11)}

以下是輸出內容：

第二個爬蟲應用

將字典、列表、元組、集合、迴圈、異常、檔案操作融合在一起

import requests
urls_dict=
urls_lst=[
('電子工業出版社',''),
('xyz','www.phei.com.cn'),
('網上書店1','/module/goods/wssd_index.jsp'),
('網上書店2','/module/goods/wssd_index.jsp')
]#利用字典抓取
crawled_urls_for_dict=set()
for ind,name in enumerate(urls_dict.keys()):
name_url = urls_dict[name]
if name_url in crawled_urls_for_dict:
print(ind,name,'已經抓取過了')
else:
try:
resp = requests.get(name_url)
except exception as e:
print(ind,name,':',str(e)[0:50])
continue
content=resp.text
crawled_urls_for_dict.add(name_url)
with open('bydict_'+name+'.html','w') as f:
f.write(content)
print('抓取完成：{} {}，內容長度為{}'.format(ind,name,len(content)))
for u in crawled_urls_for_dict:
print(u)
print('-' * 60)
#利用列表抓取
crawled_urls_for_list=set()
for ind,tup in enumerate(urls_lst):
name=tup[0]
name_url = tup[1]
if name_url in crawled_urls_for_list:
print(ind,name,'已經抓取過了')
else:
try:
resp = requests.get(name_url)
except exception as e:
print(ind,name,':',str(e)[0:50])
continue
content=resp.text
crawled_urls_for_list.add(name_url)
with open('bydict_'+name+'.html','w') as f:
f.write(content)
print('抓取完成：{} {}，內容長度為{}'.format(ind,name,len(content)))
for u in crawled_urls_for_list:
print(u)

以下是輸出內容：

抓取完成：1 網上書店2，內容長度為130100

抓取完成：2 電子工業出版社，內容長度為102494

3 網上書店1 已經抓取過了

4 xyz : invalid url 'www.phei.com.cn': no schema supplied.

/module/zygl/zxzyindex.jsp

/module/goods/wssd_index.jsp

Python實戰抓取貓眼電影TOP100

話不多說，直接上 coding utf 8 import requests from requests import requestexception import re import json from multiprocessing import pool def get one page ur...

python 抓取網頁資料

利用python進行簡單的資料分析 1 首先要進行分析網頁的html，我們所要抓取的資料是根據銷量排名的手機資訊，所以主要需要抓取手機的型號銷量，按照由小見大的方法來獲取所需要的html資訊，如下圖所示由上可以看出手機型號所在的html標籤是 h3 手機是在div中的class屬性為 pr...

python 抓取微博資料

匯入需要的模組 import urllib.request import json 定義要爬取的微博大v的微博id id 1259110474 設定 ip proxy addr 122.241.72.191 808 定義頁面開啟函式獲取微博主頁的containerid，爬取微博內容時需要此id d...

python資料抓取技術與實戰 爬蟲基礎

Python實戰 抓取貓眼電影TOP100

python 抓取網頁資料

python 抓取微博資料

相關推薦

python資料抓取技術與實戰爬蟲基礎

Python實戰抓取貓眼電影TOP100