爬蟲第一步

# 注意正規表示式的書寫注意正規表示式的書寫

import re

import requests

url=

''headers=

html=requests.get(url,headers,timeout=10).text

# print(html)

redata=re.compile(r'(.*?)')

for i in re.findall(redata,html):

print(i)

''
'使用xpath進行查詢。xpath是一種html一種語言
節點 '
''from lxml import html
lx_html=
'''<?xml version="1.0"  >
everyday italian
giada de laurentiis
2005
30.00
harry potter
j k. rowling
2005
29.99
xquery kick start
james mcgovern
per bothner
kurt cagle
james linn
vaidyanathan nagarajan
2003
49.99
learning xml
erik t. ray
2003
39.95'''
#   一開始嘗試了網上的辦法發現都比較麻煩，
# 其實雖然網上說的是python 3.5之後的lxml中不再有etree，但是其實這種說法是有問題的，
# 雖然新版本無法直接from lxml import etree這樣，但是它只不過是換了乙個辦法引出etree模組而已！
# 正確的引用方法是：
etree=html.etree
htmldiv=etree.html(lx_html)
# print(htmldiv.xpath(//bookstore))
#其返回型別為
title = htmldiv.xpath(
"//bookstore"
)print(title)
#選取第一本書的title
print(htmldiv.xpath(
"//bookstore/book[1]/title/text()"
))#選取最後一本書書的title
print(htmldiv.xpath(
"//bookstore/book[last()]/title/text()"
))# 選取倒數第二本書的所有作者
print(htmldiv.xpath(
"//bookstore/book[last()-1]/author /text()"
))#contain將選取所有class屬性中包含書本屬性
print(htmldiv.xpath(
" //book[@category="children"]/text"
))

這是基本實現過程，望本文對你有所幫助

後期持續更新，敬請期待

Python爬蟲反爬蟲第一步

request urllib2.request headers headers response urllib2.urlopen request html response.read decode utf 8 print html print response.getcode response 是伺...

爬蟲第一步獲取資料

在python中，可通過requests庫來獲取資料。windows系統在cmd命令視窗中輸入 pip install requests mac系統在terminal終端軟體中輸入 pip3 install requests requests.get 用法如下引入requests庫 impor...

踏出第一步

我是乙個比較內向的人，或許應該說有一點自卑的傾向。因為生活中的一些事情，總是不斷的打擊我的自信心，讓我產生一種感覺我缺乏能力，是乙個無用的人。我想有過這種經歷的，肯定不只我乙個人。人的信心有時候是很脆弱的，兩三次的失敗就可能讓其消失殆盡，然後你就覺得，反正我也做不出什麼事情來，乾脆就這樣混著吧，於...

爬蟲第一步

Python爬蟲 反爬蟲第一步

爬蟲第一步 獲取資料

踏出第一步

相關推薦

Python爬蟲反爬蟲第一步

爬蟲第一步獲取資料