Python實現爬取逐浪小說的方法

本人喜歡在網上看**，一直使用的是****閱讀器，可以自動從網上**想看的**到本地，比較方便。最近在學習python的爬蟲，受此啟發，突然就想到寫乙個爬取**內容的指令碼玩玩。於是，通過在逐浪上面分析源**，找出結構特點之後，寫了乙個可以爬取逐浪上**內容的指令碼。

具體實現功能如下：輸入**目錄頁的url之後，指令碼會自動分析目錄頁，提取**的章節名和章節鏈結位址。然後再從章節鏈結位址逐個提取章節內容。現階段只是將**從第一章開始，www.cppcns.com每次提取一章內容，回車之後提取下一章內容。其他**的結果可能有不同，需要做一定修改。在逐浪測試過正常。

現分享此**，一是做個記錄，方便自己以後回顧。二麼也想拋磚引玉，希望各路大神不吝賜教。

#-*-coding:utf8-*-

#!/usr/bin/python

# python: 2.7.8

# platform: windows

# program: novels程式設計客棧 from internet

# author: wucl

# description: get novels

# version: 1.0

# history: 2015.5.27 完成目錄和url提取

# 2015.5.28 完成目錄中正則提取第*章，提取出章節鏈結並**。在逐浪測試**無誤。

from bs4 import beautifulsoup

import urllib2,re

def get_menu(url):

"""get chapter name and its url"""

user_agent = "mozilla/5.0 (windows nt 6.1; wow64; rv:39.0) gecko/20100101 firefox/39.0"

headers =

req = urllib2.request(url,headers = headers)

page = urllib2.urlopen(req).read()

soup = beautifulsoup(page)

novel = soup.find_all('title')[0].text.split('_ebywobpvi')[0]

# 提取**名

menu =

all_text ebywobpvi= soup.find_all('a',target="_blank")

# 提取記載有**章節名和鏈結位址的模組

regex=re.compile(ur'\u7b2c.+\u7ae0')

# 中文正則匹配第..章，去除不必要的鏈結

for title in all_text:

if re.findall(regex,title.text):

name = title.text

x = [name,title['href']]

menu.append(x)

# 把記載有**章節名和鏈結位址的列表插入列表中

return menu,novel

def get_chapter(name,url):

"""get every chapter in menu"""

html=urllib2.urlopen(url).read()

soup=beautifulsoup(html)

content=soup.find_all('p') # 提取**正文

return content[0].text

if __name__=="__main__":

url=raw_input("""input the main page's url of the novel in zhulang\n then press enter to continue\n""")

if url:

menu,title=get_menu(url)

print title,str(len(menu))+'\n press enter to continue \n'

# 輸出獲取到的**名和章節數

for i in menu:

chapter=get_chapter(i[0],i[1])

raw_input()

print '\n'+i[0]+'\n'

print chapter

print '\n'

本文標題: python實現爬取逐浪**的方法

本文位址:

Python實現爬取逐浪小說的方法

Python爬取小說

python 爬取小說

python爬取小說

Python實現爬取逐浪小說的方法

Python爬取小說

python 爬取小說

python爬取小說

相關推薦