自己閒來無聊,就爬取了網易資訊,重點是分析網頁,使用抓包工具詳細的分析網頁的每個鏈結,資料儲存在sqllite中,這裡只是簡單的解析了新聞頁面的文字資訊,並未對資訊進行解析
僅供參考,不足之處請指正
#coding:utf-8
import
random, re
import
sqlite3
import
json
from bs4 import
beautifulsoup
import
sysreload(sys)
sys.setdefaultencoding(
'utf-8')
import
uuid
import
requests
session =requests.session()
defmd5(str):
import
hashlib
m =hashlib.md5()
m.update(str)
return
m.hexdigest()
defwangyi():
for i in range(1,3):
if i ==1:
k = ""
else
: k = "
_0" +str(i)
url = "
" + k + "
.js?callback=data_callback
url headers =
result = session.get(url=url,headers=headers).text
try:
result1 = eval(eval((json.dumps(result)).replace('
data_callback(
','').replace('
)','').replace('
',''
)))
except
:
pass
try:
for i in
result1:
tlink = i['
tlink']
headers2 =
"tlinktlinktlinktlink
",tlink
return_data = session.get(url=tlink,headers=headers2).text
try:
soup = beautifulsoup(return_data, '
html.parser')
returnsoup = soup.find_all("
div", attrs=)[0]
returnsoup
"******************************=
"try
: returnlist = re.findall('
',str(returnsoup))
content1 = '
<-->
'.join(returnlist)
except
: content1 =""
try:
returnlist1 = re.findall('
(.*?)
',str(returnsoup))
content2 = '
<-->
'.join(returnlist1)
except
: content2 =""
content = content1 +content2
except
: content = ""
cx = sqlite3.connect("
c:\\users\\xuchunlin\\pycharmprojects\\study\\db.sqlite3
", check_same_thread=false)
cx.text_factory =str
try:
"正在插入鏈結 %s 資料
" %(url)
tlink = i['
tlink']
title = (i['
title
']).decode('
unicode_escape')
commenturl = i['
commenturl']
tienum = i['
tienum']
opentime = i['
time']
title
tlink
commenturl
tienum
opentime
content
url2 =md5(str(tlink))
cx.execute(
"insert into wangyi (title,tlink,commenturl,tienum,opentime,content,url)values (?,?,?,?,?,?,?)
",(str(title), str(tlink), str(commenturl), str(tienum), str(opentime), str(content), str(url2)))
except
exception as e:
e
"cha ru shi bai
"cx.commit()
cx.close()
except
:
pass
wangyi()
Python爬蟲例項,爬取小說
import pprint import requests from bs4 import beautifulsoup 獲取原始碼 defget source url r requests.get url if r.status code 200 print r.status code 錯誤 rai...
python爬蟲例項 爬取歌單
學習自 從零開始學python 爬取酷狗歌單,儲存入csv檔案 直接上源 含注釋 import requests 用於請求網頁獲取網頁資料 from bs4 import beautifulsoup 解析網頁資料 import time time庫中的sleep 方法可以讓程式暫停 import c...
python爬取網易評論
爬取的是 最近華北空氣汙染嚴重 的新聞 1 首先獲取json格式的檔案 我用的是360瀏覽器 貌似用谷歌比較好,但我谷歌出了點問題 最新跟帖 所以要同時爬取兩種 3 處理字串 用 json.loads 解碼字串轉換為python形式時格式很重要,在這裡轉換成字典 形式。將開頭和結尾去掉,只剩下乙個大...