第1章總結初見網路爬蟲

1. urllib 還是 urllib2 ？

如果你用過 python 2.x 裡的 urllib2 庫，可能會發現 urllib2 與 urllib 有些不同。

在 python 3.x 裡，urllib2 改名為 urllib，被分成一些子模組：urllib.request、 urllib.parse 和 urllib.error。儘管函式名稱大多和原來一樣，但是在用新的 urllib 庫時需要注意哪些函式被移動到子模組裡了。

from urllib.request import urlopen 
html=urlopen("") 
print(html.read())

urllib 是 python 的標準庫；包含了從網絡請求資料，處理 cookie，甚至改變像請求頭和使用者**這些元資料的函式

2. beautifulsoup----它通過定位 html 標籤來格式化和組織複雜的網路資訊，用簡單易用的 python 物件為我們展現 xml 結構資訊。

from urllib.request import urlopen 
from bs4 import beautifulsoup 
html=urlopen("") 
bsobj = beautifulsoup(html.read()) 
print(bsobj.h1)

可以看出，我們從網頁中提取的標籤被嵌在 beautifulsoup 物件 bsobj 結構的第二層（html → body → h1）。但是，當我們從物件裡提取 h1 標籤的時候，可以直接呼叫它：

bsobj.h1

其實，下面的所有函式呼叫都可以產生同樣的結果(html → body → h1)：

bsobj.html.body.h1

bsobj.body.h1

bsobj.html.h1

其實，任何 html（或 xml）檔案的任意節點資訊都可以被提取出來，只要目標資訊的旁邊或附近有標記就行。

網頁：

— html → ... — head →— title → a useful page — body → lorem ip... — h1 → — div → lorem ipsum dolor...

3. 爬取網頁異常

eg：爬蟲 html =urlopen(「」)這行**主要可能會發生兩種異常：

• 網頁在伺服器上不存在（或者獲取頁面的時候出現錯誤）

• 伺服器不存在

（1）第一種異常發生時，程式會返回 http 錯誤。http 錯誤可能是「404 page not found」「500 internal server error」等。所有類似情形，urlopen 函式都會丟擲「httperror」異常。我們可以用下面的方式處理這種異常：

# 返回空值，中斷程式，或者執行另乙個方案

else:

# 程式繼續。注意：如果你已經在上面異常捕捉那一段**裡返回或中斷（break），

# 那麼就不需要使用else語句了，這段**也不會執行

如果程式返回 http 錯誤**，程式就會顯示錯誤內容，不再執行 else 語句後面的**。（2）如果伺服器不存在（就是說鏈結打不開，或者是 url 鏈結寫錯了），urlopen 會返回乙個 none 物件。這個物件與其他程式語言中的 null 類似。我們可以增加乙個判斷語句檢測返回的 html 是不是 none：

if html is none:     
print("url is not found") 
else:     
# 程式繼續

(3)當然，即使網頁已經從伺服器成功獲取，如果網頁上的內容並非完全是我們期望的那樣，仍然可能會出現異常。每當呼叫 beautifulsoup 物件裡的乙個標籤時，增加乙個檢查條件保證標籤確實存在是很聰明的做法。如果想要呼叫的標籤不存在，beautifulsoup 就會返

回 none 物件。不過，如果再呼叫這個 none 物件下面的子標籤，就會發生 attributeerror 錯誤。。處理和檢查這個物件是十分必要的。。如果你不檢查，直接呼叫這個 none 物件的子標籤，麻煩就來了,這時就會返回乙個異常：

attributeerror: 『nonetype』 object has no attribute 『sometag』

eg:（nonexistenttag 是虛擬的標籤，beautifulsoup 物件裡實際沒有）

print(bsobj.nonexistenttag)返回none

print(bsobj.nonexistenttag.sometag)返回乙個異常：

attributeerror: 『nonetype』 object has no attribute 『sometag』

那麼我們怎麼才能避免上面兩種情形的異常呢？最簡單的方式就是對兩種情形進行檢查：

try:     #（3）的運用
badcontent = bsobj.nonexistingtag.sometag
except attributeerror as e:     
print("tag was not found") 
else:     #（2）的運用	
if badcontent == none:         
print ("tag was not found")     
else:         
print(badcontent)

三個異常的綜合運用：

except httperror as e:#異常（1）

return ' 伺服器不存在'

try:

bsobj=beautifulsoup(html.read())

title=bsobj.body.h1

except attributeerror as e: #異常（3）

return '標籤下的屬性不存在'

return title

title= gettitle("")

if title is none: #異常(2)

print("title could not be found")

else:

print(title)

第1章總結初見網路爬蟲

一初見網路爬蟲

python資料採集1 初見爬蟲

從零開始學Python網路爬蟲第1章

第1章總結 初見網路爬蟲

一 初見網路爬蟲

python資料採集1 初見爬蟲

從零開始學Python網路爬蟲第1章

相關推薦

第1章總結初見網路爬蟲

一初見網路爬蟲