爬蟲篇部落格園搜尋爬取

寫入sql server資料庫，**如下;

import
requests
from lxml import
etree
import
pymssql
import
time
#連線sql server資料庫
conn = pymssql.connect(host='
127.0.0.1',
user='sa'
,                       password='
root',
database='a'
,                       charset='
utf8')
cursor =conn.cursor()
headers =
""""""
#寫入資料庫
definsert_sqlserver(key,data):
try:
cursor.executemany(
"insert into {}(title,contents,create_time,view_count,comment_count,good_count) values(%s,%s,%s,%s,%s,%s)
".format(key),data
)conn.commit()
except
exception as e:
print(e,'
寫入資料庫時錯誤')
#獲取資料
defget_all(key,url):
for i in range(1,51):
next_url = url+'
&pageindex=%s
'%i        res = requests.get(next_url,headers=headers)
response =etree.html(res.text)
details = response.xpath('
//div[@class="searchitem"]')
data =
print
(next_url)
for detail in
details:
try:
detail_url = detail.xpath('
./h3/a[1]/@href')
good = detail.xpath('
./div/span[3]/text()')
comments = ['0'
ifnot detail.xpath('
./div/span[4]/text()
') else detail.xpath('
./div/span[4]/text()
')[0]]
views = ['0'
ifnot detail.xpath('
./div/span[5]/text()
') else detail.xpath('
./div/span[5]/text()
')[0]]
res = requests.get(detail_url[0],headers=headers)
response =etree.html(res.text)
title = response.xpath('
//a[@id="cb_post_title_url"]/text()
')[0]
contents = response.xpath('
//div[@id="post_detail"]
') if
not response.xpath('
//div[@class="postbody"]
') else response.xpath('
//div[@class="postbody"]')
content = etree.tounicode(contents[0],method='
html')
create_time = response.xpath('
//span[@id="post-date"]/text()
')[0]
print
(detail_url[0],good[0],comments[0],views[0],title,create_time)
time.sleep(2)
except
exception as e:
print(e,'
獲取資料錯誤')
insert_sqlserver(key,data)
#//*[@id="searchresult"]/div[2]/div[2]/h3/a
#主函式並建立資料表
defmain(key,url):
cursor.execute(
"""if object_id('%s','u') is not null
drop table %s
create table %s(
id int not null primary key identity(1,1),
title varchar(500),
contents text,
create_time datetime,
view_count varchar(100),
comment_count varchar(100),
good_count varchar(100)
)"""%(key,key,key))
conn.commit()
get_all(key,url)
if__name__ == '
__main__':
key = '
python
'url = '
'%key
main(key,url)
conn.close()

檢視資料庫內容：

部落格園爬蟲模擬

原理分析 2.可以看出這個請求是get請求 3.通過http請求把資料抓取回來正則css路徑分析 regex linkcss new regex bhref s t r n s t r n s t r n s t r n s t r n regexoptions.ignorecase 搜尋匹配的字...

利用Python爬取獲取部落格園文章定時傳送到郵箱

先從開始，基本需求，獲取python板塊下面的新文章，間隔60分鐘傳送一次，時間太短估摸著沒有多少新部落格產出抓取的頁面就是這個將文章傳送到指定郵箱，更新最後一篇文章的時間模組清單 import requests import time import re import smtplib fro...

處子篇記在部落格園安家

今天是11月24日，星期天。程式設計師節1024剛好過去乙個月。在這裡開部落格，主要是記錄一下學習高階歷程。同時，偶爾寫一下生活和人生的個人感悟。整理知識分享技術。盡量多寫一些有深度的原理方法，少寫一些簡單的無腦攻略。為什麼在安家？這幾個月讀到的部落格csdn和居多。csdn的seo優化做的比較...

爬蟲篇 部落格園搜尋爬取

部落格園爬蟲模擬

利用Python爬取獲取部落格園文章定時傳送到郵箱

處子篇 記在部落格園安家

相關推薦

爬蟲篇部落格園搜尋爬取

處子篇記在部落格園安家