python爬取某一小說

經過幾天的學習簡單寫乙個爬取**的**，試試結果，可惜爬取得有些慢，下面是**：

# _*_ coding:utf-8 _*_
import urllib2,urllib
import re
import sys
from bs4 import beautifulsoup
import random
reload(sys)
sys.setdefaultencoding('utf8')
defgethtml
(url):
user_agents = [
'mozilla/5.0 (windows; u; windows nt 5.1; it; rv:1.8.1.11) gecko/20071127 firefox/2.0.0.11',
'opera/9.25 (windows nt 5.1; u; en)',
'mozilla/4.0 (compatible; msie 6.0; windows nt 5.1; sv1; .net clr 1.1.4322; .net clr 2.0.50727)',
'mozilla/5.0 (compatible; konqueror/3.5; linux) khtml/3.5.5 (like gecko) (kubuntu)',
'lynx/2.8.5rel.1 libwww-fm/2.14 ssl-mm/1.4.1 gnutls/1.2.9',
"mozilla/5.0 (x11; ubuntu; linux i686; rv:10.0) gecko/20100101 firefox/10.0 "
]user_agent_random = random.choice(user_agents)
header = 
request = urllib2.request(url,headers=header)
html = urllib2.urlopen(request).read()
html = html.decode('gbk','ignore').encode("utf8")
# print html
return html
#取得章節和章節url儲存到列表中
defgetht
(h):
soup = beautifulsoup(h,'html.parser')
html_ = soup.find_all('dd')
i = 0
book = 
book_mark = 
for im in html_:
len1 = len(im)
s=str(im)
html_url = ''+s[13:35]
i = i + 1
return book,book_mark
#取網頁內容值
#在**乙個章節網頁取內容儲存到本地
defgetcontent
(html_book,html_book_mark):
soup = beautifulsoup(html_book,'html.parser')
b = soup.find_all('div',id='content')[0]
fh = open('e://python/2.txt','a')
s = b.get_text()
st =html_book_mark+ str(s)+'\n'
fh.write(st)
fh.close()
print html_book_mark+'儲存成功'
#迴圈取出每章內容儲存
defget_par
(books,book_marks):
t = 0
for (bo,bo_mark) in zip(books,book_marks):
getcontent(gethtml(bo),bo_mark)
t+=1
if t > len(books):
print
"全部章節儲存完全"
url = '/16_16273/'
#得到書籍目錄頁
html = gethtml(url)
#得到每一章節**
book,book_m = getht(html)
#得到每一章節**內容
get_par(book,book_m)

注意點：測試時在取到某章時出現『gbk』 codec can』t decode bytes in position 7782-7783: illegal multibyte sequence錯誤，在decode時新增ignore就可以解決該問題

部分user_agent ：

user_agents = [
'mozilla/5.0 (windows; u; windows nt 5.1; it; rv:1.8.1.11) gecko/20071127 firefox/2.0.0.11',
'opera/9.25 (windows nt 5.1; u; en)',
'mozilla/4.0 (compatible; msie 6.0; windows nt 5.1; sv1; .net clr 1.1.4322; .net clr 2.0.50727)',
'mozilla/5.0 (compatible; konqueror/3.5; linux) khtml/3.5.5 (like gecko) (kubuntu)',
'lynx/2.8.5rel.1 libwww-fm/2.14 ssl-mm/1.4.1 gnutls/1.2.9',
"mozilla/5.0 (x11; ubuntu; linux i686; rv:10.0) gecko/20100101 firefox/10.0 "
]...

Python爬取小說

感覺這個夠蛋疼的，因為你如果正常寫的話，前幾次執行沒問題，之後你連都沒改，再執行就出錯了。其實這可能是網路請求失敗，或者有反爬蟲的東西吧。但這就會讓你寫的時候非常苦惱，所以這這東西，健壯性及其重要！import requests from bs4 import beautifulsoup impo...

python 爬取小說

前些天突然想看一些可能是因為壓力大，所以就要有補償機制吧。為了節省流量，就想著把內容爬下來，然後就可以在路上看了。於是有了下面的指令碼。usr bin env python coding utf 8 import requests from lxml import etree 為了解決unicod...

python爬取小說

一準備安裝 requests pyquery庫二使用定義了search類初始化時傳入第一章url 和名即可再呼叫all content方法即可 coding utf8 import re import requests from requests.exceptions import...

python爬取某一小說

Python爬取小說

python 爬取小說

python爬取小說

相關推薦