爬蟲實戰糗事百科

閒來無聊，在網上按照教程寫了乙個python爬蟲，就是竊取資料然後儲存下來爬蟲實戰–糗事百科。從糗百上爬取段子，然後輸出到console，我改了一下儲存到了資料庫。

不扯沒用的，直接上**：

這是爬取得部分

#!/usr/bin/python
# -*- coding:utf-8 -*-
import urllib
import urllib2
import re
import thread
import time
import qsbkdb
class
qsbk:
def__init__
(self):
self.db = qsbkdb.cqsbkdb("database", "user", "password")
self.db.connect_db()
self.pageindex = 1
self.user_agent = 'mozilla/5.0 (windows nt 10.0; win64; x64)'
self.headers = 
def__del__
(self):
self.db.close_db()
defgetpage
(self,pageindex):
try:
url = '' + str(pageindex)
request = urllib2.request(url,headers=self.headers)
response = urllib2.urlopen(request)
pagecode = response.read().decode('utf-8')
return pagecode
except urllib2.urlerror,e:
if hasattr(e,"reason"):
print
"error",e.reason
if hasattr(e, "code"):
print
"code",e.code
return
none
defgetpageitems
(self,pageindex):
pagecode = self.getpage(pageindex)
ifnot pagecode:
print
"page load error"
return
none
pattern = re.compile('h2>(.*?)(.*?)(.*?),re.s)
items = re.findall(pattern,pagecode)
pagestories = 
for item in items:
return pagestories
definsert_db
(self, author, support, context):
_command = 'insert into jokes(author,support, context)values(\'%s\', %d, \'%s\')'%(author, int(support), context)
#print _command
self.db.execute_db(_command)
defcawler
(self):
self.db.execute_db('truncate table jokes')
for index in range(1,36):
pagestore = self.getpageitems(index)
if pagestore == none:
print
"load page "+ str(index) +" failure\r\n"
continue
for store in pagestore:
page = index
try:
#print u"第%d頁\t發布人：%s\t 贊：%s\n%s\r\n" %(page,store[0],store[2],store[1])
self.insert_db(store[0], store[2], store[1])
except exception as e:
print
"reson",e.message
del pagestore
time.sleep(1) #這個地方一定不要乾掉

time.sleep（1）睡一下不是白睡的，如果不加這個伺服器會作出判斷，認為你是ddos攻擊（因為我們一直在request），有的**會這樣，而有的**則不會。

這是資料庫操作部分：

#!/usr/bin/python
# -*- coding:utf-8 -*- 
import mysqldb
class
cqsbkdb:
def__init__
(self, name, user, passwd):
self.db_name = name
self.db_usr = user
self.db_psw = passwd
defconnect_db
(self):
self.db_connect = mysqldb.connect("localhost", self.db_usr, self.db_psw, self.db_name, charset='utf8')
self.db_connect.select_db('qsbk') #你存放糗百的資料庫
defclose_db
(self):
self.db_connect.close()
defexecute_db
(self, command):
_cursor = self.db_connect.cursor()
try:
_cursor.execute(command)
self.db_connect.commit()
_cursor.close()
except exception as e:
print e.message
self.db_connect.rollback()

lz用的是mysql資料庫，資料庫的操作網上一抓一大把，我就不多說了。

先說一下表的結構：

名字叫做 jokes：

key是插入的編號num，設定為auto_increment自增模式，所以我們可以看見在qsbk中insert_db函式中num項為空，它會自己設定不用管。

num：段子編號 key

author：作者

support：點讚數

context：段子內容

這裡面有個需要注意的問題就是建表的時候你要讓author和context支援中文gbk。

alter table jokes modify context text character set gbk; alter table jokes modify author text character set gbk;

以上基本就是這些了，目前僅支援文字段子，不支援。

執行結果：

Python爬蟲實戰糗事百科

前面我們已經說了那麼多基礎知識了，下面我們做個實戰專案來挑戰一下吧。這次就用前面學的urllib和正規表示式來做，python爬蟲爬取糗事百科的小段子。爬取前我們先看一下我們的目標 1.抓取糗事百科熱門段子 2.過濾帶有的段子首先我們確定好頁面的url，糗事百科的是但是這個url不方便我們後面...

爬蟲糗事百科爬蟲

糗事百科爬蟲寫這個爬蟲花了我相當相當多的時間，因為總是爬著爬著就看這糗事百科上的段子去了。環境 python 3.6 import csvimport json import random import requests from bs4 import beautifulsoup class qi...

python爬蟲糗事百科

coding utf 8 import urllib2 import re 工具類 class tools object remove n re.compile r n replace br re.compile r remove ele re.compile r re.s rs 引數，要進行替換的...

爬蟲實戰 糗事百科

Python爬蟲實戰 糗事百科

爬蟲 糗事百科爬蟲

python爬蟲糗事百科

相關推薦

爬蟲實戰糗事百科

Python爬蟲實戰糗事百科

爬蟲糗事百科爬蟲