21天打造分布式爬蟲 urllib庫（一）

#
encoding:utf-8
from urllib import
request
res = request.urlopen("
")print
(res.readlines())
#urlopen的引數
#def urlopen(url, data=none, timeout=socket._global_default_timeout,
#*, cafile=none, capath=none, cadefault=false, context=none):

將網頁上的檔案儲存到本地

#
coding:utf-8
from urllib import
request
res = request.urlretrieve("
",'cnblog.html')
#urlretrieve引數
#def urlretrieve(url, filename=none, reporthook=none, data=none):

urlencode函式用於編碼中文和特殊字元

#
urlencode函式
#簡單用法
#from urllib import parse
#data = 
#qs = parse.urlencode(data)
#print(qs)    #name=%e5%be%b7%e7%91%9e%e5%85%8b&age=100
#實際用例
from urllib import
request,parse
url = "
"params = 
qs =parse.urlencode(params)
url = url + "
?" +qs
res =request.urlopen(url)
print(res.read())

parse_qs函式用於將經過編碼後的url引數進行解碼。

from urllib import
parse
qs = "
name=%e5%be%b7%e7%91%9e%e5%85%8b&age=100
"print(parse.parse_qs(qs))   #

urlparse和urlsplit都是用來對url的各個組成部分進行分割的，唯一不同的是urlsplit沒有"params"這個屬性.

協議print('

netloc:

',result.netloc) #

網域名稱print('

path:

',result.path) #

路徑print('

query:

',result.query) #

查詢引數#結果

request類的引數

class
request:
def__init__(self, url, data=none, headers={},
origin_req_host=none, unverifiable=false,
method=none):

爬去拉鉤網職位資訊

拉勾網的職位資訊是在ajax.json裡面

利用request類爬去拉勾網職位資訊

from urllib import

request,parse

url = "

"#請求頭headers =

#post請求需要提交的資料

data =

#post請求的data資料必須是編碼後的位元組型別

req = request.request(url,headers=headers,data=parse.urlencode(data).encode('

utf-8

'),method='

post

') #

建立乙個請求物件

res =request.urlopen(req)

#獲取的資訊是位元組型別，需要解碼

print(res.read().decode('

utf-8

'))

#
**的使用
from urllib import
request
url = "
"#1.使用proxyhandler傳入**構建乙個handler
#handler = request.proxyhandler()
handler = request.proxyhandler()
#2.使用建立的handler構建乙個opener
opener =request.build_opener(handler)
#3.使用opener去傳送乙個請求
res =opener.open(url)
print(res.read())

21天打造分布式爬蟲 requests庫（二）

簡單使用 import requests response requests.get text返回的是unicode的字串，可能會出現亂碼情況 print response.text content返回的是位元組，需要解碼 print response.content.decode utf 8 pr...

21天pyhton分布式爬蟲爬蟲基礎2

http協議全稱是hypertext transfer protocol，中文意思是超文字傳輸協議，是一種發布和接收html頁面的方法。伺服器端口號為80埠 https 協議是http協議的加密版本，在http下加入了ssl層，伺服器端口號是443埠當使用者在瀏覽器的位址中輸入乙個url並按回...

爬蟲分布式爬蟲

爬蟲的本質很多搞爬蟲的總愛吹噓分布式爬蟲，彷彿只有分布式才有逼格，不是分布式簡直不配叫爬蟲，這是一種很膚淺的思想。分布式只是提高爬蟲功能和效率的乙個環節而已，它從來不是爬蟲的本質東西。爬蟲的本質是網路請求和資料處理，如何穩定地訪問網頁拿到資料，如何精準地提取出高質量的資料才是核心問題。分布式爬蟲只...

21天打造分布式爬蟲 urllib庫（一）

21天打造分布式爬蟲 requests庫（二）

21天pyhton分布式爬蟲 爬蟲基礎2

爬蟲 分布式爬蟲

相關推薦

21天pyhton分布式爬蟲爬蟲基礎2

爬蟲分布式爬蟲