問題
1. 只能匹配一段
發現是因為zip(title,body)函式,最多迭代title(1次),不可能迭代到body那麼多次
for title in titles:
for body in bodies:
yield newsitem(title,wrap(body))
這樣就會有個問題,就是出現很多對title,body
而實際上是乙個title對應乙個bodies,很多個body
或者這個題目用來提取title和body,而不是展示整個新聞,描述為乙個網頁上有多個新聞,提取每個title和對應的body,但是乙個html只有乙個title,還是不對。
說明要修改newitem這個類才行
2. 列印中文符號會出現某些無法顯示,比如逗號,
3. nntp未找到伺服器,暫時注釋掉
新聞**:
**為:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from nntplib import nntp
from time import time , strftime, localtime
from email import message_from_string
from urllib import urlopen
import textwrap
import re
day = 24 *60 * 60
defwrap
(string, max =70):
return
'\n'.join(textwrap.wrap(string)) + '\n'
class
newagent
(object):
def__init__
(self):
self.sources =
self.destinations =
defaddsource
(self,source):
defadddestination
(self,dest):
defdistribute
(self):
items =
for source in self.sources:
items.extend(source.getitems())
#呼叫nnypsource和******websource兩個類方法getitem,用法為分別為兩個類繫結例項,通過例項來呼叫class裡的方法
for dest in self.destinations:
dest.receiveitems(items)
class
newsitem
(object):
def__init__
(self,title,body):
self.title = title
self.body = body
class
nntpsource
(object):
def__init__
(self,servername,group,window):
self.servername = servername
self.group = group
self.window = window
defgetitems
(self):
start = localtime(time()-self.window*day)
date = strftime('%y%m%d',start)
hour = strftime('%h%m%s',start)
server = nntp(self.servername)
ids =server.newnews(self.group,date , hour)[1]
for id in ids:
lines = server.article(id)[3]
message = message_from_string('\n'.join(lines))
title = message['subject']
body = message.get_payload()
if message.is_multipart():
body = body[0]
yield newsitem(title,body)
server.quit()
class
******websource
(object):
def__init__
(self,url,titlepattern,bodypattern):
self.url = url
self.titlepattern = re.compile(titlepattern)
self.bodypattern = re.compile(bodypattern)
defgetitems
(self):
text = urlopen(self.url).read()
titles = self.titlepattern.findall(text)
bodies = self.bodypattern.findall(text)
for title, body in zip(titles,bodies):
yield newsitem(title,wrap(body))
class
plaindesination
(object):
defreceiveitems
(self,items):
for item in items:
print item.title
print
'-'*len(item.title)
print item.body
class
htmldeatination
(object):
def__init__
(self,filename):
self.filename = filename
defreceiveitems
(self,items):
out = open(self.filename,'w')
print >> out,'''
'''print >> out, ''
id = 0
for item in items:
id += 1
print >> out, '%s
' % (id, item.title)
print >> out, ''
id = 0
for item in items:
id+=1
print >>out, '' % (id, item.title)
print >> out ,'%s
' % item.body
print >> out ,'''
'''defrundefaultsetup
(): agent =newagent()
_url = ''
_title = r'(.+?)'
_body = r'(.+?)
' bbc = ******websource(_url,_title,_body)
agent.addsource(bbc)
agent.adddestination(plaindesination())
agent.adddestination(htmldeatination('new.html'))
agent.distribute()
''' clap_server = ''
clap_group = ''
clap_window = 1
clap = nntpsource(clap_server, clap_group, clap_window)
agent.addsource(clap)
'''if __name__ == '__main__' : rundefaultsetup()
執行結果:
看著眼前這輛19萬買來的寶馬5系車,江西人小王�
��裡那叫乙個開心。如果不出意外,車子開回江西後�
�手一賣,還能再賺個兩三萬。
python 實踐 新聞聚合
採集新聞,體會到面向問題和物件導向的區別。scoure處理 destination生成報告格式。newitem用來封裝每條新聞的主題和body agent 用來新增新聞源,新增目標源。然後將每個新聞源發布給每個目標。用到的模組 nntplib import nntp time import time...
Python 爬蟲例項(4) 爬取網易新聞
自己閒來無聊,就爬取了網易資訊,重點是分析網頁,使用抓包工具詳細的分析網頁的每個鏈結,資料儲存在sqllite中,這裡只是簡單的解析了新聞頁面的文字資訊,並未對資訊進行解析 僅供參考,不足之處請指正 coding utf 8 import random,re import sqlite3 impor...
Python爬蟲基礎 4
proxy 的設定 urllib2 缺省會使用環境變數 http proxy 來設定 http proxy。如果想在程式中明確控制 proxy 而不受環境變數的影響,可以使用 簡單的 import urllib2 enable proxy true proxy handler urllib2.pro...