python 爬取GKGY會員多執行緒demo

# -*- coding: utf-8 -*-
__author__ = 'wangjingyao'
importurllib
importurllib2
importre
importsys
importthreading, queue, time
importuser_agents,random,time
reload(sys)
sys.setdefaultencoding('utf8')#設定預設編碼
_data = 
file_lock = threading.lock()
share_q = queue.queue()  #構造乙個不限制大小的的佇列
_worker_thread_num = 10  
#設定執行緒的個數
classmythread(threading.thread) :
def__init__(self
, func) :
super(mythread, 
self).__init__()  #呼叫父類的建構函式
self.func = func  #傳入執行緒函式邏輯
defrun(self) :
self.func()
defworker() :
globalshare_q
while notshare_q.empty():
url = share_q.get() #獲得任務
my_page = get_page(url)
getpageitems(my_page)  #獲得當前頁面的電影名
#write_into_file(temp_data)
time.sleep(1)
share_q.task_done()
defget_page(url) :
"""根據所給的url爬取網頁html
args:
url: 表示當前要爬取頁面的url
returns:
返回抓取到整個頁面的html(unicode編碼)
raises:
urlerror:url引發的異常
"""try:
headers=
request = urllib2.request(url,
headers = headers)
response = urllib2.urlopen(request)
my_page =  response.read().encode('gbk'
,'ignore')
returnmy_page
excepturllib2.urlerror, e :
ifhasattr(e, 
"code"):
print"the server couldn't fulfill the request."
print"error code: %s" % e.code
returnnone
elifhasattr(e, 
"reason"):
print"we failed to reach a server. please check your url and read the reason"
print"reason: %s" % e.reason
returnnone
defgetpageitems(pagecode) :
"""通過返回的整個網頁html, 正則匹配前100的電影名稱
args:
my_page: 傳入頁面的html文字用於正則匹配
"""if notpagecode:
print'pagecode init error'
returnnone
# 作者爬取
pattern = re.compile('(.*?)')
items = re.findall(pattern,pagecode)
foriteminitems:
print"authorspider------"
partterncomment = re.compile('(.*?)')
itemcomments= re.findall(partterncomment,pagecode)
foritemcommentinitemcomments:
ifitemcomment.decode('gbk') != '極客漫遊者':
print"commentspider------"
defmain() :
globalshare_q
threads = 
gkgy_url =""
#向佇列中放入任務, 真正使用時, 應該設定為可持續的放入任務
forindexinxrange(210714
,213394) :
share_q.put(gkgy_url.format(page = index))
foriinxrange(_worker_thread_num) :
thread = mythread(worker)
thread.start()  #執行緒開始處理任務
forthreadinthreads :
thread.join()
share_q.join()_datas=list(set(_data))
withopen("outgkgy.txt"
, "w+")asmy_file :
forpagein_datas :
my_file.write(page + "
\t")
print"spider successful!!!"

if__name__ == '__main__': main()

python爬蟲爬取多頁內容

前幾天零組資料庫發文關閉，第乙個念頭是可惜，想著趕緊把資料儲存下來，卻發現爬蟲已經忘得差不多了，趕緊複習一波。不多說，pycharm，啟動！不知道爬啥，隨便找個網頁吧 url 首選獲取目標 html頁面 f12提取請求頭資訊，這裡我們只需ua即可根據網頁 meta標籤設定編碼格式如下 impor...

多頁爬取資料

beautifulsoup自動將輸入文件轉換為unicode編碼，輸出文件轉換為utf 8編碼。你不需要考慮編碼方式，除非文件沒有指定乙個編碼方式，這時，beautifulsoup就不能自動識別編碼方式。這時，你只需要說明一下原始編碼方式就ok。引數用lxml就可以，需要另行安裝並載入。beauti...

python動態爬取知乎 python爬取微博動態

在初學爬蟲的過程中，我們會發現很多都使用ajax技術動態載入資料，和常規的不一樣，資料是動態載入的，如果我們使用常規的方法爬取網頁，得到的只是一堆html 沒有任何的資料。比如微博就是如此，我們可以通過下滑來獲取更多的動態。對於這樣的網頁該如何抓取呢？我們以微博使用者動態為例，抓取某名使用者的文...

python 爬取GKGY會員多執行緒demo

python爬蟲爬取多頁內容

多頁爬取資料

python動態爬取知乎 python爬取微博動態

相關推薦