20200912 -
今天收拾自己以前的**,看到了很久之前寫的乙個抓取微博熱搜的**,最起碼的兩年了,然後跑了一下,居然還行,只不過並不是理想,資料上有些偏差,但是能用。功能就是每1分鐘抓取一次,然後寫到日誌中。
#! /bin/python
#coding:utf-8
import requests
user_agents =
["mozilla/4.0 (compatible; msie 6.0; windows nt 5.1; sv1; acoobrowser; .net clr 1.1.4322; .net clr 2.0.50727)"
,"mozilla/4.0 (compatible; msie 7.0; windows nt 6.0; acoo browser; slcc1; .net clr 2.0.50727; media center pc 5.0; .net clr 3.0.04506)"
,"mozilla/4.0 (compatible; msie 7.0; aol 9.5; aolbuild 4337.35; windows nt 5.1; .net clr 1.1.4322; .net clr 2.0.50727)"
,"mozilla/5.0 (windows; u; msie 9.0; windows nt 9.0; en-us)"
,"mozilla/5.0 (compatible; msie 9.0; windows nt 6.1; win64; x64; trident/5.0; .net clr 3.5.30729; .net clr 3.0.30729; .net clr 2.0.50727; media center pc 6.0)"
,"mozilla/5.0 (compatible; msie 8.0; windows nt 6.0; trident/4.0; wow64; trident/4.0; slcc2; .net clr 2.0.50727; .net clr 3.5.30729; .net clr 3.0.30729; .net clr 1.0.3705; .net clr 1.1.4322)"
,"mozilla/4.0 (compatible; msie 7.0b; windows nt 5.2; .net clr 1.1.4322; .net clr 2.0.50727; infopath.2; .net clr 3.0.04506.30)",,
,"mozilla/5.0 (windows; u; windows nt 5.1; en-us; rv:1.8.1.2pre) gecko/20070215 k-ninja/2.1.1"
,"mozilla/5.0 (windows; u; windows nt 5.1; zh-cn; rv:1.9) gecko/20080705 firefox/3.0 kapiko/3.0"
,"mozilla/5.0 (x11; linux i686; u;) gecko/20070322 kazehakase/0.4.5"
,"mozilla/5.0 (x11; u; linux i686; en-us; rv:1.9.0.8) gecko fedora/1.9.0.8-1.fc10 kazehakase/0.5.6",,
,"opera/9.80 (macintosh; intel mac os x 10.6.8; u; fr) presto/2.9.168 version/11.52",]
url =
""headers =
import random
import sys
import time
import re
import logging
from logging.handlers import timedrotatingfilehandler
defget_items()
:while1:
try:
weibo_hot = requests.get(url,headers = headers)
break
except requests.exceptions.connectionerror:
time.sleep(1)
pattern = re.
compile
('td class="td-01 ranktop">(.*?).*?(.*?).*?(.*?)'
,re.s)
items = re.findall(pattern,weibo_hot.content)
res_list =
for item in items:
"\t"
.join(
[time.strftime(
"%y-%m-%d %h:%m:%s"
),item[0]
,item[1]
.strip(
).split(
">")[
1][:
-3],item[2]
]))return res_list
defmain()
:
log_fmt =
"%(message)s"
formatter = logging.formatter(log_fmt)
log_file_handler = timedrotatingfilehandler(filename=
"data/weibo_hot"
,when =
"midnight"
) log_file_handler.setformatter(formatter)
log_file_handler.setlevel(level=logging.info)
log = logging.getlogger(
"data/weibo_hot"
) log.addhandler(log_file_handler)
log.setlevel(logging.info)
while1:
for item in get_items():
log.info(item)
time.sleep(60)
if __name__ ==
"__main__"
: main(
)
python爬取微博熱搜
1 import requests 2importre3 import bs44 importos5 import datetime 67 url 8 headers 9try 10 r requests.get url,headers headers 11except 12 print 出現了不可...
NodeJS爬蟲微博熱搜
demo.js npm install express 安裝依賴 引入express const express require express express get index function req,res listen 8081 function 請求我們要爬取的頁面 npm instal...
python 抓取微博資料
匯入需要的模組 import urllib.request import json 定義要爬取的微博大v的微博id id 1259110474 設定 ip proxy addr 122.241.72.191 808 定義頁面開啟函式 獲取微博主頁的containerid,爬取微博內容時需要此id d...