爬取拉勾網職位資訊並存為json檔案

from bs4 import beautifulsoup
import requests
import re
import pymongo
import json
client=pymongo.mongoclient('localhost',27017)
lagou=client['lagou']
sheetlagou=lagou['sheetlagou']
url = ''
headers=
wb_data=requests.get(url,headers=headers)
soup=beautifulsoup(wb_data.text,'lxml')
areas=soup.select('span.add em')
salarys=soup.select('span.money')
expanddegrees=soup.select('div.p_bot div.li_b_l')
w1='經驗'
w2='/'
w3='  '
fileobject = open('data.json', 'w')
for area,salary,expanddegree in zip(areas,salarys,expanddegrees):
expe = (re.compile(w1 + '(.*?)' + w2, re.s)).findall(expanddegree.get_text())[0].strip()
degree = (re.compile(w2 + '(.*?)' + w3, re.s)).findall(expanddegree.get_text())[0].strip()
data = 
print(data)
# sheetlagou.insert_one(data)
jsondata = json.dumps(data)
fileobject.write(jsondata)
fileobject.close()

有點經驗

1，要注意如果是動態載入的網頁，注意利用header, 不加的話爬出來的是空列表

2..**中有部分是要向資料庫mongodb中寫資料，根據情況進行注釋

初級爬蟲爬取拉勾網職位資訊

主要用到的庫 requests 1.原始url位址，我們檢視網頁源發現裡面並沒有我們想要的職位資訊，這是因為拉勾網有反爬蟲機制，它的職位資訊是通過ajax動態載入的。2.我們按下f12，找到network 在左側name中找到 positionajax.json?needaddtionalresu...

scrapy爬蟲之爬取拉勾網職位資訊

import scrapy class lagouitem scrapy.item define the fields for your item here like name scrapy.field positionid scrapy.field 職位id，作為辨識字段插入資料庫 city sc...

拉勾網職位資料爬取按公司規模爬取

全部的見我的github 這裡改進了一下之前文章拉勾網職位資料爬取，由於拉勾網最多隻會顯示30頁的職位資訊，為了獲取更多的職位資訊，就要分類爬取。由於北京的python職位很多，超過了30頁的部分就不顯示了，我為了能夠比較全的爬取資料，就進行了分類爬取。這裡我選擇公司規模這個類別小於15人 1...

爬取拉勾網職位資訊並存為json檔案

初級爬蟲 爬取拉勾網職位資訊

scrapy爬蟲之爬取拉勾網職位資訊

拉勾網職位資料爬取 按公司規模爬取

相關推薦

初級爬蟲爬取拉勾網職位資訊

拉勾網職位資料爬取按公司規模爬取