爬蟲小程式(二)

2021-10-01 08:14:17 字數 2174 閱讀 4137

迫於課程設計的需要,簡單的抓取了一下歷史**的的一些歷史資訊

點開每乙個鏈結後會出現詳細的資訊頁

抓下來寫入csv檔案即可

由於這裡幾乎沒有什麼反爬蟲機制,所以很容易

import  requests

from lxml import etree

import csv

import codecs

import pandas as pd

def get_url():

headers =

urls =

base_url = '12/{}'

for i in range(1, 31):

text = requests.get(base_url.format(i), headers=headers).content

text = text.decode('utf-8')

html = etree.html(text)

lis = html.xpath('//div[@class="main"]/ul[@class="list clearfix"]/li')

for li in lis:

return urls

def get_text_tocsv(urls):

headers =

historys =

for url in urls:

try:

text = requests.get(url, headers=headers).content.decode('utf-8')

html = etree.html(text)

title = html.xpath('//div[@class="box main"]/div[@class="view"]/h1/text()')[0].strip()

time = html.xpath('//div[@class="box main"]/div[@class="view"]/h2/text()')[0].strip()

content = html.xpath(

'//div[@class="box main"]/div[@class="view"]/div[@class="post_public content mt5 clearfix"]//text()')

image = html.xpath(

'//div[@class="box main"]/div[@class="view"]/div[@class="post_public content mt5 clearfix"]//img/@src')[

0]content = ("".join(content)).strip()

history =

# print(historys)

except:

with open(r'c:\users\admin\desktop\history\history.csv','w',encoding='utf-8-sig', newline='') as file:

csvfile = csv.writer(file)

csvfile.writerow(['title','time','text','image_url'])

csvfile.writerows(historys)

def sort_time():

data = pd.read_csv(r'c:\users\admin\desktop\history\history.csv')

data = data.sort_values('time')

print(data.head())

data.to_csv(r'c:\users\admin\desktop\history\realhistory.csv',encoding='utf-8-sig')

if __name__ == '__main__':

# sort_time()

urls = get_url()

# print(urls)

get_text_tocsv(urls)

網頁爬蟲小程式

乙個簡單的網頁爬蟲程式 網頁爬蟲 得到網頁上的郵箱位址 得到網頁上的時間戳 public class regexdemo d d d webcrawler url 1,reg 1 得到網頁上的郵箱位址 webcrawler url 2,reg 2 得到網頁上的時間戳 param str param ...

python爬蟲小程式 python爬蟲學習小程式

coding utf 8 name 模組1 purpose author mrwang created 18 04 2014 licence import urllib def main url html urllib.urlopen url print html.read 讀取內容 print h...

python 常用小程式 網頁爬蟲

設定鏈結的路徑 url def downloadpicfromurl dest dir,url try urllib.urlretrieve url dest dir except print terror retrieving the url dest dir 執行downloadpicfromu...