Python 爬蟲例項(4) 爬取網易新聞

2021-09-07 12:37:14 字數 3142 閱讀 8385

自己閒來無聊,就爬取了網易資訊,重點是分析網頁,使用抓包工具詳細的分析網頁的每個鏈結,資料儲存在sqllite中,這裡只是簡單的解析了新聞頁面的文字資訊,並未對資訊進行解析

僅供參考,不足之處請指正

#

coding:utf-8

import

random, re

import

sqlite3

import

json

from bs4 import

beautifulsoup

import

sysreload(sys)

sys.setdefaultencoding(

'utf-8')

import

uuid

import

requests

session =requests.session()

defmd5(str):

import

hashlib

m =hashlib.md5()

m.update(str)

return

m.hexdigest()

defwangyi():

for i in range(1,3):

if i ==1:

k = ""

else

: k = "

_0" +str(i)

url = "

" + k + "

.js?callback=data_callback

"print

url headers =

result = session.get(url=url,headers=headers).text

try:

result1 = eval(eval((json.dumps(result)).replace('

data_callback(

','').replace('

)','').replace('

',''

)))

except

:

pass

try:

for i in

result1:

tlink = i['

tlink']

headers2 =

print

"tlinktlinktlinktlink

",tlink

return_data = session.get(url=tlink,headers=headers2).text

try:

soup = beautifulsoup(return_data, '

html.parser')

returnsoup = soup.find_all("

div", attrs=)[0]

print

returnsoup

print

"******************************=

"try

: returnlist = re.findall('

',str(returnsoup))

content1 = '

<-->

'.join(returnlist)

except

: content1 =""

try:

returnlist1 = re.findall('

(.*?)

',str(returnsoup))

content2 = '

<-->

'.join(returnlist1)

except

: content2 =""

content = content1 +content2

except

: content = ""

cx = sqlite3.connect("

c:\\users\\xuchunlin\\pycharmprojects\\study\\db.sqlite3

", check_same_thread=false)

cx.text_factory =str

try:

print

"正在插入鏈結 %s 資料

" %(url)

tlink = i['

tlink']

title = (i['

title

']).decode('

unicode_escape')

commenturl = i['

commenturl']

tienum = i['

tienum']

opentime = i['

time']

print

title

print

tlink

print

commenturl

print

tienum

print

opentime

print

content

url2 =md5(str(tlink))

cx.execute(

"insert into wangyi (title,tlink,commenturl,tienum,opentime,content,url)values (?,?,?,?,?,?,?)

",(str(title), str(tlink), str(commenturl), str(tienum), str(opentime), str(content), str(url2)))

except

exception as e:

print

e

print

"cha ru shi bai

"cx.commit()

cx.close()

except

:

pass

wangyi()

Python爬蟲例項,爬取小說

import pprint import requests from bs4 import beautifulsoup 獲取原始碼 defget source url r requests.get url if r.status code 200 print r.status code 錯誤 rai...

python爬蟲例項 爬取歌單

學習自 從零開始學python 爬取酷狗歌單,儲存入csv檔案 直接上源 含注釋 import requests 用於請求網頁獲取網頁資料 from bs4 import beautifulsoup 解析網頁資料 import time time庫中的sleep 方法可以讓程式暫停 import c...

python爬取網易評論

爬取的是 最近華北空氣汙染嚴重 的新聞 1 首先獲取json格式的檔案 我用的是360瀏覽器 貌似用谷歌比較好,但我谷歌出了點問題 最新跟帖 所以要同時爬取兩種 3 處理字串 用 json.loads 解碼字串轉換為python形式時格式很重要,在這裡轉換成字典 形式。將開頭和結尾去掉,只剩下乙個大...