利用 selenium 爬取糗事百科

需要：

最近看到了selenium介紹，說是可以模擬人類自動開啟網頁

很有興趣，於是學習了下，

果然：興趣是最好的老師。

說明：

選取糗事百科，因為沒有設定爬蟲robots，所以用來練手，

請不要惡意爬取。

**如下：

#
!/usr/bin/env python
#-*- coding:utf-8 -*-
import
time 
from selenium import
webdriver 
from pymongo import
mongoclient
"""1. 獲取乙個標籤就是：element
2. 獲取多個標籤就是：elements
""""""
獲取標籤文字：text
獲取href屬性值：get_attribute("href")
"""def
get_db():
client = mongoclient(host="
localhost
", port=27017)
db =client.spider
collection =db.qiushibaike_selenium
return
collection 
defget_text():
content_list = driver.find_elements_by_class_name("
main-list") 
#print(content_list)
collection =get_db()
for item in
content_list:
tm = item.find_element_by_class_name("fr"
).text
title = item.find_element_by_class_name("
title
").text
link = item.find_element_by_class_name("
title
").find_element_by_tag_name("
a").get_attribute("
href")
text = item.find_element_by_class_name("
content
").text
url =driver.current_url
out_dict =
print("
\033[31m將該段子寫入資料庫中\033[0m")
collection.insert_one(out_dict)
#print(out_dict)
defget_next():
print("")
try:
next_page = driver.find_element_by_class_name("
next")
next_page.click()
return
true
except
exception as e:
print("
這是最後一頁啦")
return
false
if__name__ == "
__main__":
driver =webdriver.firefox()
driver.get(
"") 
get_text()
time.sleep(2)
while
get_next():
get_text()
time.sleep(5)

需要掌握的知識點：1. mongo資料庫的登陸，資料插入，沒有這方面基礎的同學，可以將爬取到的結果存入到文字檔案中；

2.selenium如何定位元素，需要有一定的html，css基礎，如果什麼基礎都沒有，可以看下面的附屬小tips；

附屬小tips：

1.如何定位元素：

在網頁上面找到需要的元素，點選右鍵--檢查元素--複製--xpath即可，

2. 爬取內容時，記得設定下休眠時間，減少**壓力，同時也減少由於網頁渲染失敗導致的錯誤

簡單爬取糗事百科

剛剛入門，對於爬蟲還要折騰很久才行，雖然很多功能還沒開始掌握，但是爬取下來就很開心，接下來還會爭取進步的。把自己出現的一些錯誤都加上了注釋，我目前還在學習當中，大家一起進步。期間學了乙個新的函式，在這裡分享下 strip 網上是這麼說的需要注意的是，傳入的是乙個字元陣列，編譯器去除兩端所有相應的字...

python 爬取糗事百科

step 1 構建乙個提取糗事百科笑話的函式import urllib2 import urllib import re import thread import time import sys reload sys sys.setdefaultencoding utf 8 defgetpage p...

Python爬取糗事百科

一引入模組因為urlopen功能比較簡單，所以設定 ip需引入proxyhandler和build opener模組，ip的獲取可以上西祠查詢 import re from urllib.request import request,build opener,proxyhandler base...

利用 selenium 爬取糗事百科

簡單爬取糗事百科

python 爬取糗事百科

Python爬取糗事百科

相關推薦