python 爬蟲（二）解析庫的簡單使用

當我們在獲取到網頁相應內容的時候，就會使用去解析它過濾得到想要的內容

正則re

lxml 庫

beautiful soup

pyquery

jsonpath

擷取部分內容，以下所有例子將以這個響應內容來示範，假設響應的內容字串定義為乙個變數 content

使用python 中內建的模組 re正則模組

如解析頁面上所有的名人的名字：

import re
pat = re.
compile
('(.*?)'
)print
(pat.findall(content)
)

輸出：[『albert einstein』, 『j.k. rowling』, 『albert einstein』, 『jane austen』, 『marilyn monroe』, 『albert einstein』, 『andré gide』, 『thomas a. edison』, 『eleanor roosevelt』, 『steve martin』]

lxml 支援xpath 的解析方式，那什麼是xpath解析呢？

xpath 使用路徑表示式來選取 xml 文件中的節點或節點集。節點是通過沿著路徑 (path) 或者步 (steps) 來選取的。 xpath 解析方式

同樣使用上面的例子，首先需要安裝 lxml庫

from lxml import etree
html = etree.html(content)
authors = html.xpath(
"//small[@class='author']//text()"
)print
(authors)

beautifulsoup也是python的乙個html或xml解析庫，最主要的功能就是從網頁爬取我們需要的資料。

首先需要安裝 beautifulsoup 解析器pip install beautifulsoup4

from bs4 import beautifulsoup
soup = beautifulsoup(content,
"lxml"
)authors = soup.select(
'small.author'
)for author in authors:
print
(author.get_text(
))

pyquery語法與前端 jquery的用法幾乎一樣

from pyquery import pyquery as pq
doc = pq(content)
authors = doc(
'small.author'
)for author in authors.items():
print
(author.text(
))

會使用jsonpath的地方，一般響應的內容是json資料。

語法：xpathjsonpathresult

/store/book/author$.store.book[*].authorthe authors of all books in the store

//author$..authorall authors

/store/*$.store.*all things in store, which are some books and a red bicycle.

/store//price$.store..pricethe price of everything in the store.

//book[3]$..book[2]the third book

//book[last()]$..book[(@.length-1)]$..book[-1:]the last book in order.

//book[position()<3]$..book[0,1]$..book[:2]the first two books

//book[isbn]$..book[?(@.isbn)]filter all books with isbn number

//book[price<10]$..book[?(@.price<10)]filter all books cheapier than 10

//*$..*all elements in xml document. all members of json structure.

這裡使用一段 json 資料

我們來獲取所有的作者和所有**

import jsonpath
import json
json_str =
''',,,
],"bicycle": 
}}'''jc = json.loads(json_str)
jp = jsonpath.jsonpath(jc,
'$..author'
)print
(jp)
jp = jsonpath.jsonpath(jc,
'$.store..price'
)print
(jp)

輸出：

[『nigel rees』, 『evelyn waugh』, 『herman melville』, 『j. r. r. tolkien』]

[8.95, 12.99, 8.99, 22.99, 19.95]

Python網路爬蟲之解析庫

xpath，全稱xml path language，即xml路徑語言，它是一門在xml文件中查詢資訊的語言，但是它同樣適用於html文件的搜尋 xpath常用規則表示式描述 nodename 選取此節點的所有子節點從當前節點擊取直接子節點從當前節點擊取子孫節點選取當前節點選取當前節點的父...

python簡單爬蟲（pycharm）二

python簡單爬蟲 pycharm 二我們來把他的文字，也就是標籤下的東西給爬出來。比如這一段，注意那句這裡選用beautifulsoup包。首先開啟cmd，進入安裝python的資料夾下的script資料夾然後正常的安裝 pip install beautifulsoup4裝完長這樣 u...

Python簡單爬蟲入門二

上一次我們爬蟲我們已經成功的爬下了網頁的源那麼這一次我們將繼續來寫怎麼抓去具體想要的元素首先回顧以下我們beautifulsoup的基本結構如下 usr bin env python coding utf 8 from bs4 import beautifulsoup import reques...

python 爬蟲（二） 解析庫 的簡單使用

Python網路爬蟲之解析庫

python簡單爬蟲（pycharm） 二

Python簡單爬蟲入門二

相關推薦

python 爬蟲（二）解析庫的簡單使用

python簡單爬蟲（pycharm）二