4chan 爬蟲爬蟲第一步爬取網頁

一、安裝庫

爬蟲主要使用python(字串|urllib)+selenium+phantomjs+beautifulsoup。還另外需要pip install httplib2。

由於案例是python2，一些語句使用python3時會報錯：

1、import urllib.request

2、write() argument must be str, not bytes

原因：python3給open函式新增了名為encoding的新引數，而這個新引數的預設值卻是『utf-8』。這樣在檔案控制代碼上進行read和write操作時，系統就要求開發者必須傳入包含unicode字元的例項，而不接受包含二進位制資料的bytes例項。

解決方法：使用二進位制寫入模式(『wb』)來開啟待操作檔案，而不能像原來那樣，採用字元寫入模式(『w』)

知識點：

1、urlopen(url, data, timeout)

url=url，data=訪問url時要傳送的資料，timeout=超時時間(2，3可為空)

建立乙個表示遠端url的類檔案物件，然後像本地的檔案一樣操作這個類檔案物件來獲取遠端資料

爬蟲第一步

注意正規表示式的書寫注意正規表示式的書寫 import re import requests url headers html requests.get url,headers,timeout 10 text print html redata re.compile r for i in re.fi...

Python爬蟲反爬蟲第一步

request urllib2.request headers headers response urllib2.urlopen request html response.read decode utf 8 print html print response.getcode response 是伺...

爬蟲第一步獲取資料

在python中，可通過requests庫來獲取資料。windows系統在cmd命令視窗中輸入 pip install requests mac系統在terminal終端軟體中輸入 pip3 install requests requests.get 用法如下引入requests庫 impor...

4chan 爬蟲 爬蟲第一步 爬取網頁

爬蟲第一步

Python爬蟲 反爬蟲第一步

爬蟲第一步 獲取資料

相關推薦

4chan 爬蟲爬蟲第一步爬取網頁

Python爬蟲反爬蟲第一步

爬蟲第一步獲取資料