python 爬蟲原來華軟新聞網也可以這樣爬！

三、完整**

四、參考

網路爬蟲方法形形色色，咱們今天搞點簡單的，使用news*****第三方庫直接爬取，當然其中也遇到一些問題，畢竟每個新聞**的結果略有不用，那麼就取其精華去其糟粕啦！使用news*****庫獲取標題和正文很方便，但是跳轉下乙個，還是老老實實地使用xpath來提取了。

news*****框架是python爬蟲框架中在github上點贊排名第三的爬蟲框架，適合抓取新聞網頁。它的操作非常簡單易學，即使對完全沒了解過爬蟲的初學者也非常的友好，簡單學習就能輕易上手，因為使用它不需要考慮header、ip**，也不需要考慮網頁解析，網頁源**架構等問題。這個是它的優點，但也是它的缺點，不考慮這些會導致它訪問網頁時會有被直接拒絕的可能。總得來說，news*****框架並不適用於實際工程類新聞資訊爬取工作，框架不穩定，爬取過程中會有各種bug，例如獲取不到url、新聞資訊等，但對於想獲取一些新聞語料的朋友不妨一試，簡單方便易上手，且不需要掌握太多關於爬蟲方面的專業知識。

import time
import requests                 #獲取整個網頁
from lxml import etree          #獲取下乙個新聞位址
from news***** import article   #第三方news*****庫爬取網頁信心

安裝

pip3 install news*****3k pip3 install requests

pip3 install lxml

def
get_html
(url)
:    html = article(url,language=
'zh'
)    html.download(
)#爬取網頁
html.parse(
)#分析網頁
text = html.text.split(
)#去掉多餘的空格及其他雜項    
content =
''for i in text[1:
]:content = content + i   #遍歷到字串中輸出
data = html.title +
'\n'
+'\n'
+ content
print
('爬取資料成功！'
)return data

def
get_next_url
(url)
:    headers =
r = requests.get(url,headers=headers)
r.raise_for_status(
)#獲取失敗給出錯誤碼
r.encoding =
'utf-8'
html = etree.html(r.text)
#etree分析網頁
index_url =
''next_url = index_url + html.xpath(
"//li[@class='previous']/a/@href")[
0]return next_url

import time
import requests                 #獲取整個網頁
from news***** import article   #第三方news*****庫爬取網頁信心
defget_next_url
(url)
:    headers =
r = requests.get(url,headers=headers)
r.raise_for_status(
)    r.encoding =
'utf-8'
html = etree.html(r.text)
#etree分析網頁
index_url =
''next_url = index_url + html.xpath(
"//li[@class='previous']/a/@href")[
0]return next_url
defget_html
(url)
:    html = article(url,language=
'zh'
)    html.download(
)#爬取網頁
html.parse(
)#分析網頁
text = html.text.split(
)#去掉多餘的空格及其他雜項    
content =
''for i in text[1:
]:content = content + i   #遍歷到字串中輸出
data = html.title +
'\n'
+'\n'
+ content
print
('爬取資料成功！'
)return data
# 儲存到txt檔案中
defto_txt
(data,name)
:with
open
('d:\\python\\hr_news\\txt\\'
+name,
'w',encoding=
'utf-8'
)as f:
f.write(data)
if __name__ ==
"__main__"
:    url =
'cms/7247.html'
# 新聞網要聞的第乙個新聞位址
data = get_html(url)
next_url = get_next_url(url)
# 爬取50個，並重命名txt檔案
for i in
range(50
):if i<9:
i+=1            name =
'0'+
str(i)
+'.txt'
to_txt(data,name)
print
(name+
"寫入成功"
)else
:            i+=
1            name =
str(i)
+'.txt'
to_txt(data,name)
print
(name+
"寫入成功"
)        time.sleep(1)
if next_url and
len(next_url)
>0:
data = get_html(next_url)
next_url = get_next_url(next_url)
else
:print
("到底了"
)

python新聞內容爬蟲專用包news*****詳細教程：

python news***** 框架：

python 爬蟲原來華軟新聞網也可以這樣爬！

python實現新浪新聞爬蟲

python爬蟲新聞的學習筆記

Python 網路爬蟲（新聞採集指令碼）

python 爬蟲 原來華軟新聞網也可以這樣爬！

python實現新浪新聞爬蟲

python爬蟲新聞的學習筆記

Python 網路爬蟲（新聞採集指令碼）

相關推薦

python 爬蟲原來華軟新聞網也可以這樣爬！