網頁解析器

學習任務

1.認識網頁解析器

學習目標

知識目標

1.熟悉網頁解析器

能力目標

1.能夠自主編寫網頁解析器

獲取新url函式

#獲取新的url
def _get_new_urls(self,page_url,soup):
#儲存url
new_urls=set();
#獲取所有的url
#獲取每乙個url
new_url=link['href']
#上面獲取的url不完整要拼接
#urljoin  這個函式能夠按照page_url的格式拼接字串
new_full_url=urlparse.urljoin(page_url,new_url)
new_urls.add(new_full_url)
return new_urls

解析資料函式

#解析資料   我們需要解析title和soup兩個資料
def _get_new_data(self,page_url,soup):
res_data={}
res_data['url']=page_url;
#匹配title節點
title_node=soup.find('dd',class_='lemmawgt-lemmatitle-title',).find('h1')
#獲取title的文字資訊
res_data['title']=title_node.get_text();
summary_node=soup.find('div',class_='lemma-summary')
res_data['summary'] = summary_node.get_text();
return res_data

得到新的url列表函式

#從cont中解析出兩個資料（新的url列表和資料）
def  parse(self,page_url,html_cont):
if page_url is none and html_cont is none:
return;
soup=beautifulsoup(html_cont,'html.parser',from_encoding='utf-8')
new_urls=self._get_new_urls(page_url,soup)
#進行解析出新的資料
new_data = self._get_new_data(page_url, soup)
return new_urls,new_data

任務實施

展示**

任務實施1 任務實施2

操作演示

知識點總結

1.網頁解析器具備的功能

2.網頁解析器需要編寫的函式

問題1.什麼是網頁解析器

2.網頁解析器的作用

3.有哪幾種網頁解析器

答案1.從網頁中提取有價值資訊的工具

3.正規表示式解析：將整個html網頁當做字串來進行模糊匹配

結構化解析：將整個網頁文件載入城乙個dom樹來進行解析

03網頁解析器

網頁解析器從網頁中提取有價值資料的工具,也會提取到網頁中所有的url，用於後續的訪問。python網頁解析器 1.正規表示式最直觀，將網頁當作是乙個字串，進行模糊匹配但如果對於較為複雜的文件，會相當複雜 2.html.parser python自帶 3.beautifulsoup 第三方外掛程式...

1 6 網頁解析器beautifulsoup

beautifulsoup介紹 beautifulsoup實戰為了實現解析器，可以選擇使用 1.正規表示式 2.html.parser 3.beautifulsoup 4.lxml等，這裡我們選擇beautifulsoup。其中，正規表示式基於模糊匹配，而另外三種則是基於dom結構化解析。而且be...

使用解析器

使用解析器使用解析器是非常簡單，可以使用它自己的詞法分析器，但是，用fsyacc.exe 產生的解析器總是要求詞法分析器。在這一小節，我們將討論如何使用自己的詞法分析器，以及與解析器聯合。警告記住f 編譯器不能直接使用.fsl 和 fsy 檔案，需要用fslex.exe 和 fsyacc.exe ...

網頁解析器

03網頁解析器

1 6 網頁解析器beautifulsoup

使用解析器

相關推薦