本爬蟲主要使用了requests、json、bs4(beautifulsoup)等相關模組,不完善之處請大家不吝賜教!:)
出處:
# -*- coding:utf-8 -*-
import requests, json, time
from bs4 import beautifulsoup
class tencent_hr(object):
def __init__(self):
self.base_url = ""
self.headers =
self.item_list =
self.page = 0
# 傳送請求
def send_request(self, url, params={}):
time.sleep(2)
try:
response = requests.get(url, params=params, headers=self.headers)
return response.content
except exception as e:
print e
# 解析資料
def parse_data(self, data):
# 初始化
bs = beautifulsoup(data, 'lxml')
# 獲取標籤-結果為列表
data_list = bs.select('.even, .odd')
# 將結果中的每一行資料提取出來
for data in data_list:
data_dict = {}
data_dict['work_name'] = data.select('td a')[0].get_text()
data_dict['work_type'] = data.select('td')[1].get_text()
data_dict['work_count'] = data.select('td')[2].get_text()
data_dict['work_place'] = data.select('td')[3].get_text()
data_dict['work_time'] = data.select('td')[4].get_text()
# 將每條字典資料新增進列表
# 判斷是否是最後一頁,條件:是否有noactive值
next_label = bs.select('#next')
# 根據標籤獲取屬性class的值-返回結果為列表
judge = next_label[0].get('class')
return judge
# 寫入檔案
def write_file(self):
# 將列表轉換成字串
data_str = json.dumps(self.item_list)
with open('04tencent_hr.json', 'w') as f:
f.write(data_str)
# 排程執行
def run(self):
while true:
# 拼接引數
params =
# 傳送請求
data = self.send_request(self.base_url, params=params)
# 解析資料
judge = self.parse_data(data)
self.page += 10
print self.page
# 如果到了最後一頁,出現noactive,跳出迴圈
if judge:
break
self.write_file()
if __name__ == '__main__':
spider = tencent_hr()
spider.run()
python爬蟲爬取騰訊招聘資訊 (靜態爬蟲)
環境 windows7,python3.4 親測可正常執行 1 import requests 2from bs4 import beautifulsoup 3from math import ceil 45 header 78 9 獲取崗位頁數 10def getjobpage url 11 re...
python爬蟲爬取騰訊網招聘資訊
話不多說,直接上 from bs4 import beautifulsoup import urllib2 import json 使用了json格式儲存 deftengxun detail,num url detail position.php?start 0 a request urllib2....
python 爬蟲 基本抓取
首先,python中自帶urllib及urllib2這兩個模組,基本上能滿足一般的頁面抓取,另外,requests 也是非常有用的。對於帶有查詢欄位的url,get請求一般會將來請求的資料附在url之後,以?分割url和傳輸資料,多個引數用 連線。data requests data為dict,js...