例項化乙個etree物件,且需要將被解析的頁面的原始碼資料載入到該物件中
呼叫etree物件中的xpath方法結合著xpath表示式實現標籤的定位和內容的捕獲
pip install lxml
from lxml import etree
1、將本地的html檔案中的原始碼載入到etree物件中
etree.parse(filepath)
2、將網際網路獲取的原始碼載入到該物件中
etree.html(page_text)
import requests
from lxml import etree
from fake_useragent import useragent
url = ''
headers =
response = requests.get(url=url, headers=headers)
page_text = response.text
tree = etree.html(page_text)
room_list = tree.xpath("//section[@class='list']/div/a/div[2]")
for room in room_list:
title = room.xpath(".//div[@class='property-content-title']//h3/text()")[0]
price = room.xpath(
".//div[@class='property-price']//span[@class='property-price-total-num']/text()")[0]
**g = room.xpath(".//p[@class='property-price-**erage']/text()")[0]
from lxml import etree
import requests
from fake_useragent import useragent
import time
url = ''
ua = useragent()
# 彼岸圖網爬取
def pic_down(page, searchid):
param =
headers = # 隨機ua
response = requests.get(url=url, headers=headers, params=param)
if response.status_code != 200:
print("當前狀態碼為: ", response.status_code)
return false
page_text = response.text
# 爬取當前頁所有**的主頁鏈結
index_etree = etree.html(page_text)
index_list = index_etree.xpath("//ul[@class='clearfix']/li/a/@href")
for picture_index_url in index_list:
headers = # 隨機ua
pic_response = requests.get(url=picture_index_url, headers=headers)
pic_etree = etree.html(pic_response.text)
pic_link = '' + \
pic_etree.xpath("//a[@id='img']/img/@src")[0]
fp.write(pic_link+'\n')
print(pic_link)
print('成功爬取第 {} 頁\n', page)
return true
if __name__ == '__main__':
fp = open('鏈結.txt', 'w', encoding='utf-8')
for i in range(0, 5):
if pic_down(i, 16):
time.sleep(3)
else:
print('爬取失敗')
break
fp.close()
import requests
from lxml import etree
from fake_useragent import useragent
ua = useragent()
url = ''
# "//div[@id='container']/div/a/@href"
headers =
response = requests.get(url=url, headers=headers)
page_text = response.content
index_tree = etree.html(page_text)
index_link = index_tree.xpath("//div[@id='container']/div/a/@href")
for link in index_link:
# "https://"+link 為主頁鏈結
風火程式設計 python爬蟲幾個xpath解析方法
requests獲取的響應體 from lxml import etree html etree.html response.text 二進位制型別用.content result html.xpath expression 返回list,乙個用 0 selenium獲取的響應體 result re...
python 爬蟲之xpath用法
xpath全稱為xml path language一種小型的查詢語言,在爬蟲中,我們其實就是拿它來搜尋html文件,僅此而已。而網頁內容只有通過解析才能進行搜尋,所以使用xpath時,需要引入lxml庫,這個庫就是來解析網頁,協助xpath進行搜尋的。lxml庫的安裝,可以直接使用 pip3 ins...
Python之爬蟲 etree和XPath實戰
下面 是在 上找到的乙個例子,空閒的時候可以自己除錯。coding utf 8 爬蟲 創業邦 創業公司資訊爬取 網頁url 爬取頁面中的創業公司,融資階段,創業領域,成立時間和創業公司的鏈結資訊。使用到requests,json,codecs,lxml等庫 requests用於訪問頁面,獲取頁面的源...