運用pyhon從百度貼吧抓取資料，改進版本

#coding=utf-8

import urllib2

import urllib

import json

from lxml import etree

import chardet #這個庫在函式裡面用不到但是在開發中實際上使用這個庫還是比較多的，用來檢測字串編碼

page=open('python_zongjie.json','w')

request=urllib2.request('')

response=urllib2.urlopen(request)

rehtml=response.read().decode('utf-8')

html=etree.html(rehtml)

result=html.xpath('//li[@class=" j_thread_list clearfix"]')

#要非常注意空格，關於檢測空格的方法可以參考chrome外掛程式『xpath helpr』

item={}

for site in result:

#chardet.detect(etree.tostring(site)) 測字串編碼用

item['title_name']=site.xpath('.//a[@class="j_th_tit "]')[0].text

item['author-name']=site.xpath('.//a[@class="frs-author-name j_user_card "]')[0].text

item['author-id']=site.xpath('.//a[@class="frs-author-name j_user_card "]')[1].text

item['reply_date']=site.xpath('.//span[@class="threadlist_reply_date pull_right j_reply_data"]')[0].text.strip()

item['rep_num']=site.xpath('.//span[@class="threadlist_rep_num center_text"]')[0].text

outitem = json.dumps(item,ensure_ascii=false)

page.write(outitem.encode('utf-8')+'\n')

page.close()

python百度貼吧發帖簽到百度貼吧簽到指令碼

本指令碼為我從網上各渠道蒐集到的簽到指令碼的雜交如果不需要日誌則把帶日誌記錄的行刪除即可 from requests import session from time import time 日誌記錄 start time time 資料 log path f e data sign log ...

百度貼吧爬蟲

encoding utf 8 import urllib.request import urllib.parse import time import random def load page url 通過url來獲取網頁內容jfa param url 待獲取的頁面 return url對應的網頁內...

Python3 爬蟲抓取百度貼吧

前言天象獨行 import os,urllib.request,urllib.parse 測試要求 1 輸入吧名，首頁，結束頁進行爬蟲。2 建立乙個以吧名為名字的資料夾，裡面是每一頁的html的內容，檔名格式吧名 page.html url ba name input home page int...

運用pyhon從百度貼吧抓取資料，改進版本

python百度貼吧發帖簽到 百度貼吧簽到指令碼

百度貼吧爬蟲

Python3 爬蟲 抓取百度貼吧

相關推薦

python百度貼吧發帖簽到百度貼吧簽到指令碼

Python3 爬蟲抓取百度貼吧