python爬取網頁資料例項一

# -*- coding:utf-8 -*-
from lxml import etree
import urllib2
import random
import urlparse
# 設定網路**
proxy_info = 
proxy_support = urllib2.proxyhandler()
openner = urllib2.build_opener(proxy_support)
urllib2.install_opener(openner)
#user_agent列表
user_agent_list = [
"mozilla/5.0(macintosh;intelmacosx10.6;rv:2.0.1)gecko/20100101firefox/4.0.1",
"mozilla/4.0(compatible;msie6.0;windowsnt5.1)",
"opera/9.80(windowsnt6.1;u;en)presto/2.8.131version/11.11",
"mozilla/4.0(compatible;msie7.0;windowsnt5.1)",
"mozilla/4.0(compatible;msie7.0;windowsnt5.1;trident/4.0;se2.xmetasr1.0;se2.xmetasr1.0;.netclr2.0.50727;se2.xmetasr1.0)"
]ua_header = 
url = ''
request = urllib2.request(url, headers=ua_header)
try:
response = urllib2.urlopen(request, timeout=10)
print(response.getcode())
result = response.read()
res_htm = etree.html(result)
tab_inner = res_htm.xpath("//div[@class='s_tab_inner']/*/text()")
tab_href = res_htm.xpath("//div[@class='s_tab_inner']/*/@href")
index = 0
for inner in tab_inner:
if str(inner.encode('utf-8')) == '**':
index = tab_inner.index(inner)
index = int(index) - 1
url_music = str(tab_href[index])
request = urllib2.request(url_music, headers=ua_header)
response = urllib2.urlopen(request, timeout=60)
print(response.getcode())
result = response.read()
res_htm = etree.html(result)
# print etree.tostring(res_htm, encoding='utf-8')
responsive = res_htm.xpath("//div[@id='responsive']//div[@class='search-info']//a/@href")
column = urlparse.urlparse(responsive[0])
param = urlparse.parse_qs(column.query)
id = param['id'][0]
print(id)
except:
print("網路超時,請稍後再試")

先上**(talk

ischeap. show

methe

code)，見上。

本例項使用urllib2請求網頁，lxml解析網頁結構，urlparse解析請求引數。

爬蟲偽裝的第一步是提供user-agent，如果只使用乙個user-agent，會存在被封ip的風險，所以這裡我們使用列表隨機資料。當然也可以引入一些包來實現，比如fake_useragent。

使用lxml包查詢網頁元素時，需要先了解一些xpath語法。

python爬取網頁資料

import refrom urllib.request import urlopen 爬取網頁資料資訊 def getpage url response urlopen url return response.read decode utf 8 defparsepage s ret re.find...

python進行網頁資料爬取（一）

網路資料採集的一般流程 1 通過網域名稱獲取html資料 2 根據目標資訊解析資料 3 儲存目標資訊 4 若有必要，移到另乙個網頁重複這個過程。一通過網域名稱獲取html資料使用requests庫來進行獲取html資料 import requests url r requests.get u...

Python 簡單爬取網頁資料

爬取我的csdn網頁 import requests 時出現紅線，這時候，我們將游標對準requests，按快捷鍵 alt enter，pycharm會給出解決之道，這時候，選擇install package requests，pycharm就會自動為我們安裝了，我們只需要稍等片刻，這個庫就安裝好了...

python爬取網頁資料例項 一

python爬取網頁資料

python進行網頁資料爬取（一）

Python 簡單爬取網頁資料

相關推薦

python爬取網頁資料例項一