python簡單爬去博文

#-*-encoding:utf-8-*-
from beautifulsoup import beautifulsoup
import urllib
import os
def file_fiter(title):
"""去除windows檔名非法字元"""
filename = 
for i in range(len(title)):
if title[i] not in '<>*?"\/"':
return filename
def downblog(title,url):
path = "f:/blog/" + ''.join(file_fiter(title))
os.mkdir(path)#根據類別建立儲存博文的目錄
response = urllib.urlopen(url)
soup = beautifulsoup(response.read())
all_contents_div = soup.findall('div',)
for content_div in all_contents_div:
title = content_div.find('h3',).contents[0]
content = content_div.find('div',).gettext()
#在構造檔案明時，需要注意windows檔名字元限制
open(path + "/"+''.join(file_fiter(title))+".txt","w").write(content.encode('utf-8'))
#print title
#print content
response = urllib.urlopen("")
soup = beautifulsoup(response.read())
#程式主要抓取部落格文章，通過觀察，發現主頁有個區域（div）是類別。抓取類別就可以進一步抓取
#頁面了。
categories_div = soup.find('div',)#定位分類欄
href_labels = categories_div.findall('a')#查詢所有標籤
hrefs = #所有部落格頁面鏈結
categories_titles = #每個頁面所屬類別
#類別和對應頁面連線
for label in href_labels:
#因為還沒學多執行緒了，下面就用最蝸牛的辦法獲取所有文章
for i in range(len(hrefs)):
downblog(categories_titles[i],hrefs[i])

執行結果:

程式裡我設定成在f://blog下，所以這個可以改變的。

p29csdn博文爬蟲爬取

csdn博文爬蟲爬取第一步關鍵是如何確定能夠唯一的找到那個文章先爬取整個所有博文的位址,然後在爬去取改位址的內容 import urllib.request import re url 需要瀏覽器偽裝 opener urllib.request.build opener 建立opener物件先...

python爬取微博熱搜

1 import requests 2importre3 import bs44 importos5 import datetime 67 url 8 headers 9try 10 r requests.get url,headers headers 11except 12 print 出現了不可...

python爬去糗事百科

1.用requests beautifulsoup抓取糗事百科的文字內容 2.將抓取的內容寫入txt。1.獲取網頁源 def get html url 用requests庫得到網頁源 html requests.get url text return html 2.檢視源結構找到要抓取的目標 3....

python簡單爬去博文

p29csdn博文爬蟲爬取

python爬取微博熱搜

python爬去糗事百科

相關推薦