Python爬蟲02 urllib自帶模組編寫

現在的python已經出到了3.5.2

在python 3以後的版本中，urllib2這個模組已經不單獨存在（也就是說當你import urllib2時，系統提示你沒這個模組），urllib2被合併到了urllib中，叫做urllib.request 和 urllib.error 。

urllib整個模組分為urllib.request, urllib.parse, urllib.error。

例：其中urllib2.urlopen()變成了urllib.request.urlopen()

urllib2.request()變成了urllib.request.request()

一.urllib模組的使用

urllib.request.urlopen(url,data=none,[timeout,]*,cafile=none,capath=none,cadefault=false,context=none)

-----url:需要開啟的**

-----data:post提交的資料

-----timeout:設定**的訪問超時時間

直接用urllib.request模組的urlopen（）獲取頁面，資料格式為bytes型別，需要decode（）解碼，轉換成str型別。

import urllib.request
#向指定的url位址傳送請求,並返回伺服器響應的淚檔案物件
response=urllib.request.urlopen(request)
#伺服器返回的類檔案物件支援python檔案物件的操作方法
#read()方法就是讀取檔案裡的全部內容,返回字串
html=response.read()
html=html.decode('utf-8')
#返回http的響應碼
print(response.getcode())
#返回實際資料的實際url,防止重定向問題
print(response.geturl())
#返回伺服器相應的http報頭
print(response.info())
#列印響應內容
print(html)

urlopen返回物件提供方法:

urllib.request.request(url,data=none,headers={},method=none)

使用request()來包裝請求,再通過urlopen()獲取頁面

import urllib.request
#爬蟲也反爬蟲第一步
#構建基礎的headers資訊
#通過urllib.request.request()方法構造乙個請求物件
request=urllib.request.request("",headers=ua_headers)
#向指定的url位址傳送請求,並返回伺服器響應的淚檔案物件
response=urllib.request.urlopen(request)
#伺服器返回的類檔案物件支援python檔案物件的操作方法
#read()方法就是讀取檔案裡的全部內容,返回字串
html=response.read()
html=html.decode('utf-8')
#列印響應內容
print(html)

用來包裝頭部的資料:

- usr-agent:這個投不可以攜帶如下幾條資訊:瀏覽器名和版本號,作業系統名和版本號,預設語言

- connection:表示連線狀態,記錄session的狀態

urlopen()的data引數預設為none,當data引數不為空的時候,urlopen()提交方式為post

#usr/bin/python
#-*-coding:utf-8-*-
import urllib.request
from urllib import parse
data=
data=parse.urlencode(data).encode('utf-8')
request=urllib.request.request("",headers=ua_headers,data=data)
response=urllib.request.urlopen(request)
html=response.read()
html=html.decode('utf-8')
#列印響應內容
print(html)

urllib.parse.urlencode()主要作用是將url附上要提交的資料

經過urlencode()轉換後的data資料為?first=true?pn=1?kd=python,最後提交的url為

first=true?pn=1?kd=python

urlencode方法所在位置

urllib.parse.urlencode(values)

字元的編碼翻譯

from urllib import parse
wd=print(parse.urlencode(wd))

輸出

使用

#usr/bin/python
#-*-coding:utf-8-*-
import urllib.request
from urllib import parse
url="s"
keyword=input("請輸入需要查詢的字串:")
wd=headers=
#通過parse.urlencode()引數是乙個dict型別
wd=parse.urlencode(wd)
#拼接完整的url
fullurl=url+"?"+wd
#構建請求
request=urllib.request.request(fullurl,headers=headers)
response=urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

#usr/bin/python
#-*-coding:utf-8-*-
import urllib
from urllib import request,parse
def loadpage(url,filename):
'''作用:根據url傳送請求,獲取伺服器響應群檔案
url:需要爬取的url位址
:return:
'''request=urllib.request.request(url,headers=headers)
return urllib.request.urlopen(request).read()
def writepage(html,filename):
'''作用:將html內容寫入到本地
html:伺服器相應檔案內容
:param html:
:return:
'''print("正在儲存"+filename)
#檔案寫入操作   不需要做檔案關閉操作,不需要做上下文操作
with open(filename,"w") as f:
f.write(str(html))
print("儲存完成")
print("-"*30)
def tiebaspider(url,beginpage,endpage):
'''作用:貼吧爬蟲排程器,負責組合處理每個頁面的url
url:貼吧的url前部分
beginpage:起始頁
endpage:結束頁
:return:
'''for page in range(beginpage,endpage+1):
pn=(page-1)*50
filename="第"+str(page)+"頁.html"
fullurl=url+"&pn="+str(pn)
# print (fullurl)
html=loadpage(fullurl,filename)
# print(html)
writepage(html,filename)
print("謝謝使用")
if __name__=="__main__":
kw=input("請輸入需要爬取的貼吧名:")
beginpage=int(input("請輸入起始頁:"))
endpage=int(input("請輸入結束頁:"))
url=""
key=parse.urlencode()
fullurl=url+key
tiebaspider(fullurl,beginpage,endpage)

爬蟲 Python爬蟲學習筆記之Urllib庫

1.urllib.request開啟和讀取url 2.urllib.error包含urllib.request各種錯誤的模組 3.urllib.parse解析url 4.urllib.robotparse解析 robots.txt檔案傳送get請求引入urlopen庫用於開啟網頁 from u...

Python爬蟲02 請求模組

七 json資料 response.text 返回unicode格式的資料 str response.content 返回位元組流資料二進位制 response.content.decode utf 8 手動進行解碼 response.url 返回url response.encode 編碼 im...

python 爬蟲系列02 認識 requests

本系列所有文章基於 python3.5.2 requests 是基於 urllib 的三方模組,相比於 uillib,操作更簡潔,功能更強大,而且支援 python3 getimport requests r requests.get url print r.status code print r....

Python爬蟲02 urllib自帶模組編寫

爬蟲 Python爬蟲學習筆記之Urllib庫

Python爬蟲02 請求模組

python 爬蟲系列02 認識 requests

相關推薦