python提取內容使用Python提取小說內容

具體實現功能如下：輸入**目錄頁的url之後，指令碼會自動分析目錄頁，提取**的章節名和章節鏈結位址。然後再從章節鏈結位址逐個提取章節內容。現階段只是將**從第一章開始，每次提取一章內容，回車之後提取下一章內容。其他**的結果可能有不同，需要做一定修改。在逐浪測試過正常。

#-*-coding:utf8-*-

#!/usr/bin/python

# python: 2.7.8

# platform: windows

# program: get novels from internet

# author: wucl

# description: get novels

# version: 1.0

# history: 2015.5.27 完成目錄和url提取

from bs4 import beautifulsoup

import urllib2,re

def get_menu(url):

"""get chapter name and its url"""

user_agent = "mozilla/5.0 (windows nt 6.1; wow64; rv:39.0) gecko/20100101 firefox/39.0"

headers =

req = urllib2.request(url,headers = headers)

page = urllib2.urlopen(req).read()

soup = beautifulsoup(page)

novel = soup.find_all('title')[0].text.split('_')[0] # 提取**名

menu =

all_text = soup.find_all('a',target="_blank") # 提取記載有**章節名和鏈結位址的模組

regex=re.compile(ur'\\u7b2c.+\\u7ae0') # 中文正則匹配第..章，去除不必要的鏈結

for title in all_text:

if re.findall(regex,title.text):

name = title.text

x = [name,title['href']]

return menu,novel

def get_chapter(name,url):

"""get every chapter in menu"""

html=urllib2.urlopen(url).read()

soup=beautifulsoup(html)

content=soup.find_all('p') # 提取**正文

return content[0].text

if __name__=="__main__":

url=raw_input("""input the main page's url of the novel in zhulang\\n then press enter to continue\\n""")

if url:

menu,title=get_menu(url)

print title,str(len(menu))+'\\n press enter to continue \\n' # 輸出獲取到的**名和章節數

for i in menu:

chapter=get_chapter(i[0],i[1])

raw_input()

print '\\n'+i[0]+'\\n'

print chapter

print '\\n'

python微博內容提取

import requests import re import json from bs4 import beautifulsoup 微博要用cookies登入乙個知識點有script裡的內容用正則取出再處理 headers cookies處理格式 url res requests.get u...

python網頁內容提取神器lxml

一 xpath是什麼 xpath 是一門在 xml 文件中查詢資訊的語言。xpath 用於在 xml 文件中通過元素和屬性進行導航。xpath 使用路徑表示式在 xml 文件中進行導航 xpath 包含乙個標準函式庫 xpath 是 xslt 中的主要元素 xpath 是乙個 w3c 標準二 xp...

html內容提取

前段時間，一直在弄html提取問題，可謂道路曲折當然，現在看來是走了些彎路現小結一下。總得來說，一般有三種方法第一種方法直接提取即只提取除之外的的東東，具體實現上，可以直接獲取之內的文字，也可以先踢出之內的文字。這裡給出直接獲取之內的文字的程式片段。注意了，這種方法思路簡單，但效果...

python提取內容 使用Python提取小說內容

python微博內容提取

python網頁內容提取神器lxml

html內容提取

相關推薦

python提取內容使用Python提取小說內容