Python簡單爬蟲學習

爬蟲：一段自動抓取網際網路資訊的程式。

爬蟲排程器：程式入口，主要負責爬蟲程式的控制

url管理器：管理帶抓取url集合和已抓取的url集合。

url實現的功能有：

1.新增新的url到待爬去集合

2.判斷待新增url是否已存在

3.判斷是否還有待爬的url，將url從待爬集合移動到已爬集合

url的儲存方式：python記憶體即set()集合，關聯式資料庫，快取資料庫

網頁解析器：從網頁中提取出有價值的資料，實現方法有：正規表示式、html.parser、beautifulsoup、lxml

url管理器**：

class urlmanager(object):
def __init__(self):
self.new_urls=set();#待爬的url集合
self.old_urls=set();#已爬的url集合
def add_new_url(self,url):#往待爬集合新增新的url
if url is none:
return
if url not in self.new_urls and url not in self.old_urls: #需要判斷url是否已存在或已爬
self.new_urls.add(url)
def add_new_urls(self,urls):#將解析獲得的url批量匯入待爬集合
if urls is none or len(urls)==0:
return
for url in urls:
self.add_new_url(url)
def has_new_url(self):#判斷是否還有待爬的url
return len(self.new_urls) !=0
new_url = self.new_urls.pop()
self.old_urls.add(new_url)
return new_url
import urllib2

class html**********(object):#返回url指向網頁的內容
def download(self,url):
if url is none:
return none
response = urllib2.urlopen(url)
if response.getcode()!=200:
return none
return response.read()

網頁解析器：

from bs4 import beautifulsoup
import re
import urlparse
class htmlparser(object):

def _get_new_urls(self, page_url, soup):#從網頁中獲取包含的url
def _get_new_data(self, page_url, soup):#從網頁中提取價值資料
res_data={}
#url
res_data['url']=page_url
#title_node=soup.find('dd',class_="lemmawgt-lemmatitle-title")
res_data['title']=title_node.get_text()
#summary_node=soup.find('div',class_="lemma-summary")
res_data['summary']=summary_node.get_text()
return res_data
def parser(self,page_url,html_cont):#呼叫兩個函式進行網頁解析
if page_url is none or html_cont is none:
return 
soup=beautifulsoup(html_cont,'html.parser',from_encoding='utf-8')
new_urls=self._get_new_urls(page_url,soup)
new_data=self._get_new_data(page_url,soup)
return new_urls,new_data

資料輸出：

class htmloutputer(object):
def __init__(self):
self.datas=
def collect_data(self,data):
if data is none:
return none
def outputer_html(self):
fout = open('output.html','w')
fout.write("")
fout.write("")
fout.write("")
for data in self.datas:
fout.write("")
fout.write("%s" % data['url'].encode('utf-8'))
fout.write("%s" % data['title'].encode('utf-8'))
fout.write("%s" % data['summary'].encode('utf-8'))
fout.write("")
fout.write("")
fout.write("")
fout.write("")

爬蟲總排程程式：

from baike_spider import html_**********,url_manager,html_outputer,html_parser
class spidermain(object):
def __init__(self):
self.urls=url_manager.urlmanager()
self.**********=html_**********.html**********()
self.parser=html_parser.htmlparser()
self.outputer=html_outputer.htmloutputer()
def craw(self, root_url):
count=1
self.urls.add_new_url(root_url)
while self.urls.has_new_url():
try:
new_url=self.urls.get_new_url()
print 'craw %d :%s'%(count,new_url)
html_cont=self.**********.download(new_url)
new_urls,new_data=self.parser.parser(new_url,html_cont)
self.urls.add_new_urls(new_urls)
self.outputer.collect_data(new_data)
if count == 1000:
break
count=count+1
except:
print 'craw faild'
self.outputer.outputer_html()
if __name__=="__main__":
#原始url
root_url=""
obj_spider=spidermain()
obj_spider.craw(root_url)

簡單學習python爬蟲

學爬蟲之前首先知道什麼是爬蟲 ret.content 按照位元組顯示 ret.text 按照字串顯示注以上內容跟下面無關 1.新建乙個python專案spyder 名字自起 2.點選file中的settings 3.點選project spyder下的project interpreter 4....

python爬蟲簡單 python爬蟲簡單版

學過python的帥哥都知道，爬蟲是python的非常好玩的東西，而且python自帶urllib urllib2 requests等的庫，為爬蟲的開發提供大大的方便。這次我要用urllib2，爬一堆風景。先上重點 1 response urllib2.urlopen url read 2 soup...

Python開發簡單爬蟲學習筆記

1.爬蟲簡介爬蟲是能夠自動抓取網際網路資訊的程式 2.簡單爬蟲架構 3.url管理器 url管理器管理待抓取url集合和已抓取url集合防止重複抓取防止迴圈抓取 urllib2 python官方基礎模組 requests 第三方包更強大，後期推薦使用 import urllib2 直接請求 ...

Python簡單爬蟲學習

簡單學習python爬蟲

python爬蟲簡單 python爬蟲 簡單版

Python開發簡單爬蟲 學習筆記

相關推薦

python爬蟲簡單 python爬蟲簡單版

Python開發簡單爬蟲學習筆記