Python爬蟲獲取拉勾網招聘資訊

之前寫過乙份爬取拉勾網搜尋「資料分析」相關職位的文章拉勾網職位資訊爬蟲練習

，最近入職了一家設計為主的公司，所以想做乙份關於「設計」的資料分析報告，發現直接跑原來的**會爬不到資料，所以稍微修改了一下。本篇主要記錄爬蟲**。

#匯入使用的庫
import requests
from bs4 import beautifulsoup
import json
import pandas as pd
import time
from datetime import datetime
#從職位詳情頁面內獲取職位要求
defgetjobneeds
(positionid)
:    url =
''headers =
s = requests.session(
)    s.get(url.
format
(positionid)
, headers=headers, timeout=3)
# 請求首頁獲取cookies
cookie = s.cookies  # 為此次獲取的cookies
response = s.get(url.
format
(positionid)
, headers=headers, cookies=cookie, timeout=3)
# 獲取此次文字
time.sleep(5)
#休息 休息一下
soup = beautifulsoup(response.text,
'html.parser'
)    need =
' '.join(
[p.text.strip(
)for p in soup.select(
'.job_bt div')]
)return need
#獲取職位具體資訊#獲取職位具體 
defgetjobdetails
(jd)
:    results=
results[
'businesszones'
]= jd[
'businesszones'
]    results[
'companyfullname'
]= jd[
'companyfullname'
]    results[
'companylabellist'
]= jd[
'companylabellist'
]    results[
'financestage'
]= jd[
'financestage'
]    results[
'skilllables'
]= jd[
'skilllables'
]    results[
'companysize'
]= jd[
'companysize'
]    results[
'latitude'
]= jd[
'latitude'
]    results[
'longitude'
]= jd[
'longitude'
]    results[
'city'
]= jd[
'city'
]    results[
'district'
]= jd[
'district'
]    results[
'salary'
]= jd[
'salary'
]    results[
'secondtype'
]= jd[
'secondtype'
]    results[
'workyear'
]= jd[
'workyear'
]    results[
'education'
]= jd[
'education'
]    results[
'firsttype'
]= jd[
'firsttype'
]    results[
'thirdtype'
]= jd[
'thirdtype'
]    results[
'positionname'
]= jd[
'positionname'
]    results[
'positionlables'
]= jd[
'positionlables'
]    results[
'positionadvantage'
]= jd[
'positionadvantage'
]    positionid = jd[
'positionid'
]    results[
'need'
]= getjobneeds(positionid)
time.sleep(2)
#設定暫停時間，控制頻率
print
(jd,
'get'
)return results
#獲取整個頁面上的職位資訊
defparselistlinks
(url_start,url_parse)
:    jobs =
from_data =
headers =
res =
for n in
range(30
):from_data[
'pn'
]= n +
1        s = requests.session(
)        s.get(url_start, headers=headers, timeout=3)
# 請求首頁獲取cookies
cookie = s.cookies  # 為此次獲取的cookies
response = s.post(url_parse, data=from_data, headers=headers, cookies=cookie, timeout=3)
# 獲取此次文字
time.sleep(5)
jd =
for m in
range
(len
(res)):
.text)
['content'][
'positionresult'][
'result'])
for j in
range
(len
(jd)):
for i in
range(15
):[i])
)    time.sleep(30)
return jobs
defmain()
:    url_start =
"設計?city=%e6%88%90%e9%83%bd&cl=false&fromsearch=true&labelwords=&suginput="
url_parse =
""jobs_total = parselistlinks(url_start,url_parse)
now = datetime.now(
).strftime(
'%m%d_%h%m%s'
)    newsname =
'lagou_sj'
+now+
'.xlsx'
#按時間命名檔案
df = pd.dataframe(jobs_total)
df.to_excel(newsname)
print
('檔案已儲存'
)if __name__ ==
'__main__'
:    main(
)

拉勾網每頁有15條資料，預設顯示30頁，一共450條資料。我這裡直接寫死啦，大家可以根據需要修改爬取頁數。也可以選擇不獲取「崗位要求」資訊，或者其他不需要的資訊。儲存下來的檔案是這個樣子的。

Python爬取拉勾網招聘資訊

最近自學研究爬蟲，特找個地方記錄一下就來到了51cto先測試一下。第一次發帖不太會。先貼個首先開啟拉勾網首頁，然後在搜尋框輸入關鍵字python。開啟抓包工具。因為我的是mac os，所以用的自帶的safari瀏覽器的開啟時間線錄製。通過抓取post方法，可以看到完整url 然後可以發現post...

爬蟲拉勾網 selenium

使用selenium進行翻頁獲取職位鏈結，再對鏈結進行解析會爬取到部分空列表，感覺是網速太慢了，加了time.sleep 還是會有空列表 1 from selenium import webdriver 2import requests 3importre4 from lxml import et...

node爬蟲抓取拉勾網資料

初始化 1.安裝了node 2.新建乙個資料夾 3.在該資料夾中初始化node應用 npm init安裝依賴使用express框架使用superagent庫 superagent 是乙個輕量級漸進式的請求庫，內部依賴 nodejs 原生的請求 api,適用於 nodejs 環境使用cheer...

Python爬蟲獲取拉勾網招聘資訊

Python爬取拉勾網招聘資訊

爬蟲 拉勾網 selenium

node爬蟲抓取拉勾網資料

相關推薦

爬蟲拉勾網 selenium