簡單爬取糗事百科

剛剛入門，對於爬蟲還要折騰很久才行，雖然很多功能還沒開始掌握，但是爬取下來就很開心，接下來還會爭取進步的。把自己出現的一些錯誤都加上了注釋，我目前還在學習當中，大家一起進步。

期間學了乙個新的函式，在這裡分享下：

strip()

網上是這麼說的

需要注意的是，傳入的是乙個字元陣列，編譯器去除兩端所有相應的字元，直到沒有匹配的字元，比如：

thestring = 'saaaay yes no yaaaass'

print thestring.strip('say')

執行結果：

yes no

這裡的兩端，只是指向thestring整個字串兩端的字元，即saaaaay 和yaaaass，將這兩個單詞，前後包含「s」,"a","y"三者之一的字元按順序一一去掉，對於中間的「yes」是無效的。

如果沒有指定的話，就會替換掉前後的空格。當rm為空時，預設刪除空白符（包括'\n', '\r', '\t', ' ')

所以如果需要替換掉一些

之類的字串，可以選擇先將其用re.sub替換成空白符，然後在用strip()刪除。

import urllib2
import re
url = ''
def get_url(url):
headers =  #記得引號
req=urllib2.request(url,headers =headers)
response = urllib2.urlopen(req)
html = response.read().decode('utf-8')
return html
def get_info(url):
html = get_url(url)
re_info = r'h2\>(.+?)\.+?\(.+?)\.+?\(.+?)\
接下來寫迴圈的部分，其實只是在最後去掉get_info(url)再加上一點點的東西
def get_all(pages):
for i in range(1,pages):
url = start_url + str(i)
get_info(url)
get_all(3)

接下來實現列印加上頁碼，回車列印每乙個段子，更是只是加上一點點的東西，新增raw_input進行判斷

def get_info(url,page):

html = get_url(url)

re_info = r'h2\>(.+?)\.+?\(.+?)\.+?\(.+?)\

如果沒有更改url，會報出如下錯誤：

traceback (most recent call last):

file "", line 38, in

get_all(3)

file "", line 34, in get_all

url = url + str(i)

unboundlocalerror: local variable 'url' referenced before assignment

為了區分，我們將url換名為start_url，完整如下

import urllib2

import re

start_url = ''

def get_url(url):

headers = #記得引號

req=urllib2.request(url,headers =headers)

response = urllib2.urlopen(req)

html = response.read().decode('utf-8')

return html

def get_info(url,page):

html = get_url(url)

re_info = r'h2\>(.+?)\.+?\(.+?)\.+?\(.+?)\

啊啊啊啊寫了下來發現迴圈部分出現錯誤，就是每一頁都得輸出q才行，得在每一頁迴圈之前加上判斷，那麼繼續更改以下函式，return用來跳出整個**過程

def get_info(url,page):

html = get_url(url)

re_info = r'h2\>(.+?)\.+?\(.+?)\.+?\(.+?)\

python 爬取糗事百科

step 1 構建乙個提取糗事百科笑話的函式import urllib2 import urllib import re import thread import time import sys reload sys sys.setdefaultencoding utf 8 defgetpage p...

Python爬取糗事百科

一引入模組因為urlopen功能比較簡單，所以設定 ip需引入proxyhandler和build opener模組，ip的獲取可以上西祠查詢 import re from urllib.request import request,build opener,proxyhandler base...

Python 爬取糗事百科

coding utf 8 import urllib2 import urllib import re class qiushi def init self self.page 1 從網頁獲取糗事 def getqiushis self,page url page 偽裝瀏覽器 user agent ...

簡單爬取糗事百科

python 爬取糗事百科

Python爬取糗事百科

Python 爬取糗事百科

相關推薦