Python入門級爬取百度百科詞條

爬取

angelababy詞條歷史版本中的value值。

# _*_ coding:utf-8 _*_
import urllib
import urllib2
import re
page = 1
url = ''+str(page)
try:
request = urllib2.request(url)
response = urllib2.urlopen(request)
print response.read()
except urllib2.urlerror, e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason

執行結果：

可以看到已經爬取了此網頁所有的內容。現在需要實現的就是爬取想要的value值了。

可以看到要爬取的內容，格式全部一樣都是圖中所示，**如下：

. .

所以我們做以下正則匹配：

pattern = re.compile('.*?.*?.*?.*?',re.s)

全部**如下：

# _*_ coding:utf-8 _*_
import urllib
import urllib2
import re
page = 1
url = ''+str(page)
try:
request = urllib2.request(url)
response = urllib2.urlopen(request)
content = response.read().decode('utf-8')
pattern = re.compile('.*?.*?.*?.*?',re.s)
items = re.findall(pattern,content)
for item in items:
print(item)
except urllib2.urlerror,e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason

爬取結果如下：

崔慶才的個人部落格

python3爬取百度百科

在每個頁面裡只爬 h1 標題和下面的一段簡介準備工作資料庫需要三個字段，id，標題，內容資料庫一定要在建立的時候加入 character set utf8 不然會引發好多錯誤開始爬！先找到當前頁面的所有內鏈找規律是 item 開頭的，所以利用正規表示式刷刷刷，之後利用beatuiful很...

java爬取百度百科詞條

一 parsehtml部分此部分用於對html中的標籤進行分析，提取出相應的可以內容 url和文字內容 public class parsehtml public void parse content document document,listcontents 二用於解析url所獲取的html...

爬取百度百科1000個頁面資料

實現自己遇到的問題以及處理方法 q1 response urlib.request.urlopen response.read 多次read為空b a1 read 後，原response會被清空 q2 使用python寫檔案的時候，或者將網路資料流寫入到本地檔案的時候，大部分情況下會遇到 unic...

Python入門級爬取百度百科詞條

python3爬取百度百科

java爬取百度百科詞條

爬取百度百科1000個頁面資料

相關推薦