python多執行緒爬蟲抓取網頁

突發想法，抓取***資料以便採用機器學習分析練手，網頁為年份。步驟如下： 1：每乙個子執行緒抓取每一年的網頁 2：抓取網頁後利用正規表示式抽取資料，存入多維list。 3：構建sql語句，存入mysql。

#!user/bin/env python3  
# -*- coding:utf-8 -*- 
from bs4 import beautifulsoup
from urllib.request import urlopen
import threading
import re
import datetime
import pymysql
table=
defloadmarksix
():    threads = 
for year in range(2008,2017):
t1 = threading.thread(target=loadyear,args=(year,))
return threads
defloadyear
(year):
print('%d start' % year)
url="" + str(year) + ".htm"
html=urlopen(url)
html=html.read()
bsobj=beautifulsoup(html)
page=bsobj.gettext()
##2013/01/12
pattern='[0-9]/[0-9]+/[0-9]+'
datelist=re.findall(pattern,page)
pattern='[0-9]+\.gif'
codelist=re.findall(pattern,str(html))
##total=int(len(codelist)/7)
##漢字.漢字.漢字.漢字.漢字
pattern='[\u4e00-\u9fa5]+\.[\u4e00-\u9fa5]+\.[\u4e00-\u9fa5]+\.[\u4e00-\u9fa5]+\.[\u4e00-\u9fa5]+'  
summarylist=re.findall(pattern,page)##[豬.單數.小數.綠波.家畜]152     
for i in range(0,total):
row=
code=0
codesum=0
for j in range(0,6):
code=int(re.findall('[0-9]+',codelist[6*i+j])[0])
codesum+=code
info=summarylist[i].split('.')##['猴', '雙數', '小數', '綠波', '野獸']
if info[1]=='單數'
else
0)if info[2]=='大數'
else
0)if info[3]=='紅波'
else (2
if info[3]=='藍波'
else
3))    print('%d complete\n' % year)
if __name__ == '__main__':
threads=loadmarksix()
for t in threads:
t.setdaemon(false)
t.start()
for t in threads:
t.join()
lottery =sorted(table)
sql='insert into `lottery` values '
id=0
sqlvalue=''
for row in lottery:
id+=1
sqlvalue+=u"(%d,'%s',%d,%d,%d,%d,%d,%d,%d,%d,%d,'%s',%d,%d,%d,'%s')," % (id,row[0],row[1],row[2],row[3],row[4],row[5],row[6],row[7],row[8],row[9],row[10],row[11],row[12],row[13],row[14])
sql+=sqlvalue[:-1]
begin = datetime.datetime.now()
db = pymysql.connect("localhost","user","password","marksix" )
cursor = db.cursor()
db.set_charset('utf8')##否則無法插入中文
cursor.execute(sql)
db.commit()
db.close()
end = datetime.datetime.now()
print(end-begin)

總計插入1368條記錄，耗時0.01504sec

採用生成.sql檔案，在mysql執行插入，耗時0.000sec

對於串成sql語句，還有另乙個寫法

begin = datetime.datetime.now()
sql='insert into `lottery` values '
id=0
sqlvalue=''
for row in lottery:
id+=1
sqlvalue+='('
sqlvalue+=str(id)
sqlvalue+=u","
sqlvalue+=str(lottery[0])[1:-1]
sqlvalue+=u"),"
sql+=sqlvalue[:-1]
end = datetime.datetime.now()
print(end-begin)

兩種字串連線方法的時間對比：

0:00:00.004010

0:00:00.007018

後者在僅保留sqlvalue+=str(lottery[0])[1:-1]

這句時，時間為

0:00:00.004507

如採用+連線sql語句，耗時0:00:00.037025

python多執行緒實現抓取網頁

python實現抓取網頁以下的python抓取網頁的程式比較0基礎。僅僅能抓取第一頁的url所屬的頁面，僅僅要預定url足夠多。保證你抓取的網頁是無限級別的哈，以下是 coding utf 8 無限抓取網頁 author wangbingyu date 2014 06 26 import sys,...

CURL多執行緒抓取網頁

網上這類方法似乎太多了。但是總是有一些問題存在。對於多執行緒抓取，現在有如下幾種思路 1.用apache的多執行緒特性，讓php進行多程序操作，就像post本身一樣 2.用curl的curl multi庫對於第一種，我還沒嘗試，因為這種製造的是偽多執行緒，也許效率會低很多，而且不好控制。第二種...

c 多執行緒抓取網頁內容

1.2.好了，認識分析完問題，就是解決問題了多執行緒在c 中並不難實現。它有乙個命名空間 system.threading 提供了多執行緒的支援。要開啟乙個新執行緒，需要以下的初始化 threadstart startdownload new threadstart download 執行緒起始...

python多執行緒爬蟲抓取網頁

python多執行緒實現抓取網頁

CURL多執行緒抓取網頁

c 多執行緒抓取網頁內容

相關推薦