Python開發簡單爬蟲學習筆記

1.爬蟲簡介

爬蟲是能夠自動抓取網際網路資訊的程式

2.簡單爬蟲架構

3.url管理器

url管理器：管理待抓取url集合和已抓取url集合

----防止重複抓取、防止迴圈抓取

urllib2： python官方基礎模組

requests：第三方包更強大，後期推薦使用

import urllib2
#直接請求
response = urllib2.urlopen('')
# 獲取狀態碼，如果是200表示獲取成功
print response.getcode()
#讀取內容
cont = response.read()

新增標頭檔案偽裝成瀏覽器訪問是為了防止有些**監測到爬蟲封ip的情況。

import urllib2
#建立request物件
request = urllib2.request(url)
#新增資料
request.add_data('a','1')
#新增http的header
request.add_header('user-agent','mozilla/5.0')
#傳送請求獲取結果
#建立cookie容器
cj = cookielib.cookiejar()
#建立1個opener
#給urllib2安裝opener
urllib2.install_opener(opener)
#使用帶有cookie的urllib2訪問網頁
response = rullib2.urlopen("/")

url 例項

import urllib2
url = ""
print '第一種方法'
response1 = urllib2.urlopen(url)
print response1.getcode() #狀態碼  200表示成功
print len(response1.read())
print "第二種方法"
request = urllib2.request(url)
request.add_header("user-agent","mozilla/5.0")  #偽裝成瀏覽器
response2 = urllib2.urlopen(url)
print response2.getcode()
print len(response2.read())
print '第三種方法'
cj = cookielib.cookiejar()
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print cj
print response3.read()

5.網頁解析器（beautifulsoup）

網頁解析器是從網頁中提取有價值資料的工具

python網頁解析器種類

字串模糊匹配

正規表示式

結構化解析

html.parser、beautiful soup 、lxml

結構化解析-dom（document object model）樹

通常我們使用beautifulsoup作為網頁解析器

基本語法為：

#根據html網頁字串建立beautifulsoup物件

soup = beautifulsoup(

html_doc, #html文件字串

'html.parser' #html解析器

from_encoding ='utf8' #html文件的編碼 )

搜尋節點（find_all,find）

#方法： find_all(name,attrs,string)

#查詢所有標籤為a的節點

soup.find_all('a')

#查詢所有標籤為a，鏈結符合/view/123.htm形式的節點

#查詢所有標籤為div，class為abc，文字為python的節點

soup.find_all('div', class_='abc', string ='python')

訪問節點資訊

#得到節點： python

#獲取查詢到的節點的標籤名稱

node.name

# 獲取查詢到的a節點的href屬性

node['href']

# 獲取查詢到的a節點的鏈結文字

node.get_text()

beautifulsoup 例項測試

# -*- coding: utf-8 -*-
from bs4 import beautifulsoup
import re
html_doc = """
the dormouse's story
once upon a time there were three little sisters; and their names were
elsie,
lacie and
tillie;
and they lived at the bottom of a well.
..."""
soup = beautifulsoup(html_doc, 'html.parser', from_encoding='utf-8')
print '獲取所有的鏈結'
links = soup.find_all('a')
for link in links:
print link.name, link['href'], link.get_text()
print '獲取lacie的鏈結'
link_node = soup.find('a', href = '')
print link_node.name, link_node['href'], link_node.get_text()
print '正則匹配'
link_node = soup.find('a', href = re.compile(r"ill"))
print link_node.name, link_node['href'], link_node.get_text()
print '獲取p段落文字'
p_node = soup.find('p', class_="title")
print p_node.name,  p_node.get_text()

6.完整例項

分析目標 url格式資料格式網頁編碼

編寫**

執行爬蟲

完整**見

Python開發簡單爬蟲學習筆記（1）

乙個簡單的爬蟲可以由一下幾部分構成 1.爬蟲排程端啟動，停止，監控運況，也就是整個爬蟲的main。2.url管理器管理待爬取和已爬取的url，可以將已經獲得的url儲存在記憶體或者關係型資料庫中或者快取資料庫中。記憶體中儲存可以用set 語句可去除重複資料用關係型資料庫儲存時設計兩個列，其中...

Python簡單爬蟲學習

爬蟲一段自動抓取網際網路資訊的程式。爬蟲排程器程式入口，主要負責爬蟲程式的控制 url管理器管理帶抓取url集合和已抓取的url集合。url實現的功能有 1.新增新的url到待爬去集合 2.判斷待新增url是否已存在 3.判斷是否還有待爬的url，將url從待爬集合移動到已爬集合 url的儲存...

簡單學習python爬蟲

學爬蟲之前首先知道什麼是爬蟲 ret.content 按照位元組顯示 ret.text 按照字串顯示注以上內容跟下面無關 1.新建乙個python專案spyder 名字自起 2.點選file中的settings 3.點選project spyder下的project interpreter 4....

Python開發簡單爬蟲 學習筆記

Python開發簡單爬蟲學習筆記（1）

Python簡單爬蟲學習

簡單學習python爬蟲

相關推薦

Python開發簡單爬蟲學習筆記