Python爬蟲全（wei）攻略

爬蟲完整**

實在不想寫期中的bp作業……索性先趁著剛寫完的熱乎勁，給這篇攻略起個頭。這篇攻略主要是針對文字資訊的抓取，會稍微麻煩一點，假如需要的話，可能需要更多的html和css基礎，並且要了解python的儲存語法。

本攻略需要一定的程式設計基礎，和基本的html和css基礎，雖然之後會有所講解，但……怕省略了一些比較基礎的說明，客官看不懂……簡單掌握一門程式語言，並且有基本html和css基礎的客官可以跳過第一部分首先，客官要明白網頁的本質是什麼樣子的，你所看到的網頁一般是這樣的 ↓ ↓ ↓

但**伺服器發給瀏覽器的東西是這樣的 ↓ ↓ ↓

title="頻道首頁"
href=""
class="img blog-icon">
a>

抓取流程大致為如下 ↓ ↓ ↓

今天先到這裡，明天接著更

**還不精簡，之後會刪去一些不必要的內容

import requests  # 匯入requests 模組
import re
from urllib import request
from bs4 import beautifulsoup  # 匯入beautifulsoup 模組
import pymysql
import pymysql.cursors
import random
import time
import os  # 匯入os模組
class
beautifulpicture
():def
__init__
(self):
# 類的初始化操作
self.web_url = ''
# 要訪問的網頁位址
defget_pic
(self, p):
global count
web_url = '' % (p)
config = 
print('開始網頁get請求')
r = self.request(web_url)
print('開始獲取所有a標籤')
all_a = beautifulsoup(r.text, 'lxml').find_all('dl', __addition='0')  # 獲取網頁中的class為cv68d的所有a標籤
i = 0
for a in all_a:  # 迴圈每個標籤，獲取標籤中的url並且進行網路請求，最後儲存
img_str = a  # a標籤中完整的style字串
matchobj = re.search(r'class="t".*?
if matchobj:
matchobj2 = re.search(r'f=".*?"', matchobj.group())
# print('  開始網頁get請求')
# print(matchobj2.group()[3:-1])
r2 = self.request(matchobj2.group()[3:-1])
matchobj_name = re.search(r'pos_name">.*?
if matchobj_name:
a = (matchobj_name.group()[11:-1])
else:
a = "空"
print("    空")
matchobj_title = re.search(r'pos_title">.*?
if matchobj_title:
b = (matchobj_title.group()[12:-1])
else:
print("    未讀取到種類")
break
matchobj_salary = re.search(r'pos_salary">.*?
if matchobj_salary:
c = (matchobj_salary.group()[12:-1])
else:
c = "面議"
matchobj_des = re.search(r'class="posdes".*?, r2.text)
#print(matchobj_des.group())
d = (matchobj_des.group()[32:-2])
connection = pymysql.connect(**config)
try:
with connection.cursor() as cursor:
# 執行sql語句，插入記錄
sql = 'insert into irm values (%s, %s, %s, %s,"58")'
cursor.execute(sql, (a, b, c, d))
# 沒有設定預設自動提交，需要主動提交，以儲存所執行的語句
connection.commit()
finally:
connection.close();
i += 1
count += 1
print("  本輪第%d條記錄,共%d條記錄" % (i,count))
if i % 20 == 0:
print("*****=暫停5秒*****=")
time.sleep(5)
defrequest
(self, url):
# 返回網頁的response
headers = 
r = requests.get(url, headers=headers, timeout=10)
return r
defproxy_test
(self, url):
try:
requests.get("", proxies=, timeout=3)
except:
i = 0
else:
i = 1
return i
defget_proxy_ip
(self):
headers = 
req = request.request(r'', headers=headers)
response = request.urlopen(req)
html = response.read().decode('utf-8')
proxy_list = 
ip_list = re.findall(r'\d+\.\d+\.\d+\.\d+', html)
port_list = re.findall(r'\d+', html)
for i in range(len(ip_list)):
ip = ip_list[i]
port = re.sub(r'|', '', port_list[i])
proxy = '%s:%s' % (ip, port)
return proxy_list
beauty = beautifulpicture()  # 建立類的例項
count = 0
for i in range(500,600):
print("*****=第%d頁*****="%(i))
time.sleep(2)
beauty.get_pic(i)  # 執行類中的方法

python 爬蟲攻略

看完初級入門再去搞中級入門啊，這是中級入門的鏈結 import urllib.request as req 有的時候module較長就給他簡化咯 url content req.request url 進入url獲取資訊 response req.urlopen content data respo...

python3爬蟲快速入門攻略

複製過來的內容一什麼是網路爬蟲？1 定義網路爬蟲 web spider 又被稱為網頁蜘蛛，是一種按照一定的規則，自動地抓取資訊的程式或者指令碼。2 簡介網路蜘蛛是乙個很形象的名字。如果把網際網路比喻成乙個蜘蛛網，那麼spider就是在網上爬來爬去的蜘蛛。網路蜘蛛是通過網頁的鏈結位址來尋找網...

Python爬蟲王者榮耀全面板拉取

開門見山，話不多說英雄資訊列表 import requests import os 英雄資訊列表 hero url 英雄的字首位址 skin url perfix 當前檔案的絕對路徑 abs path os.path.abspath 獲取所有英雄資訊 head response requests...

Python爬蟲全（wei）攻略

python 爬蟲攻略

python3爬蟲快速入門攻略

Python爬蟲 王者榮耀全面板拉取

相關推薦

Python爬蟲王者榮耀全面板拉取