爬蟲入門（一）用Python爬取靜態HTML網頁

系統環境：

作業系統：windows10 專業版 64bit python：anaconda2、python2.7

python packages:requests、beautifulsoup os

新手入門爬蟲時一般都會先從靜態html網頁下手，並且爬取html網頁不難，容易上手。遇到沒見過函式可以找度娘，去理解那些函式有什麼作用，弄清楚那些引數的用途，然後用多幾次，就大概知道他的套路是怎麼樣的了（小白我就是這樣入門滴）。好了，廢話不多說，上**：

# -*- coding: utf-8 -*-
"""created on thu apr 26 18:09:20 2018
@author: zww
"""import requests
from bs4 import beautifulsoup
import os
proxies = 
hd=url=''
req =requests.get(url,headers=hd, proxies =proxies )
#print req
bs=beautifulsoup(req.content,'html.parser')
div_all=bs.find_all('div',attrs=) #爬取所有的div標籤
for infos in div_all:
a_all=infos.find_all('a')  #在div標籤裡找到所有的a標籤
for a in a_all[0:10]:         #獲取其中10個a標籤的資訊
ind_code=a.get('href')[-7:-1]  #在a標籤中利用切片切出行業**
ind_name=a.text                 #獲取a標籤的文字內容，即獲取行業名稱
#    print ind_name
os.mkdir("e:\\test\\"+ind_name)  #在e盤中建立test資料夾，並在test中建立以行業名稱命名的資料夾
for i in range(1,4):  #設定i變數，迴圈3次，即只爬取前3頁的行業新聞
url1=''+ind_code+'/index_'+str(i)+'.shtml'
req1 =requests.get(url1,headers=hd, proxies =proxies )
#print req.text
bs1=beautifulsoup(req1.content,'html.parser')
span_all=bs1.find_all('span',attrs=)  #爬取所有span標籤
for span in span_all:   
a_all=span.find_all('a',attrs=)  #在span標籤中爬取所有a標籤
news_link=span.a.get('href')                        #在a標籤中獲取新聞鏈結
req2=requests.get(news_link,headers=hd, proxies =proxies) #訪問新聞鏈結
#print req2
bs2=beautifulsoup(req2.content,'html.parser')
try:            
news_title=bs2.find('h2').text  #爬取新聞標題，
except:                             #如果標題中出現特殊字元等異常，跳過本次迴圈，進入下一次迴圈
continue
p_all=bs2.find_all('p')    #找到所有的p標籤
#            print news_title
try:                      #開啟ind_name資料夾中以新聞標題的文字檔案，並執行寫的功能
path1="e:\\test\\"+ind_name+"\\"+news_title+".txt"
fo = open(path1, "w")
except:                   #出現因特殊字元不能寫入等異常時，跳過本次迴圈
continue
for p in p_all:
news=p.text.encode('utf-8')   #爬取在p標籤中的新聞內容     
fo.write(news)  #把新聞內容寫入文字文件裡
fo.close()          #關閉檔案

**中有比較詳細的注釋，如有不足或不妥之處，請指出。

Python 爬蟲爬取網頁

工具 python 2.7 import urllib import urllib2 defgetpage url 爬去網頁的方法 request urllib.request url 訪問網頁 reponse urllib2.urlopen request 返回網頁 return response...

python爬蟲爬取策略

在爬蟲系統中，待抓取url佇列是很重要的一部分。待抓取url佇列中的url以什麼樣的順序排列也是乙個很重要的問題，因為這涉及到先抓取那個頁面，後抓取哪個頁面。而決定這些url排列順序的方法，叫做抓取策略。下面重點介紹幾種常見的抓取策略一深度優先遍歷策略深度優先遍歷策略是指網路爬蟲會從起始頁開始...

python爬蟲 seebug爬取

1.找相關的標籤一步一步往下查詢 2.有cookie才能查詢 3.用import re而不用from re import 是為了防止衝突 coding utf 8 from requests import import re from bs4 import beautifulsoup as bs h...

爬蟲入門（一） 用Python爬取靜態HTML網頁

Python 爬蟲爬取網頁

python爬蟲爬取策略

python爬蟲 seebug爬取

相關推薦

爬蟲入門（一）用Python爬取靜態HTML網頁