Python 簡單的爬蟲

爬取的資料是：豆瓣電影top250

使用的python庫有：requests、bs4的beautifulsoup、pandas。

通過requests爬取網頁資料，通過beautifulsoup解析網頁資料，通過pandas將資料儲存成excel、csv格式。

import requests  #爬取資料
from bs4 import beautifulsoup  #解析資料
import pandas as pd  #儲存資料
#主函式，連線其他函式
def main():
url = ""
html = geturldata(url)
gethtmldata(html,url)
#獲取網頁資料
def geturldata(url):
try:
# 有些**會拒絕爬蟲訪問，需要模擬瀏覽器訪問**，新增請求頭
headers = 
r = requests.get(url,headers=headers,timeout=30)
r.raise_for_status
html = r.text
return html
except:
return '發生異常'
#解析網頁資料
def gethtmldata(html,url):
alldata=
em_data = 
name_data = 
name_other_data=
quote_data=
star_data=
href_data=
for i in range(0,10):
baseurl = url + str(i*25)
html = geturldata(baseurl)
soup = beautifulsoup(html,'html.parser')
body = soup.body
#需要爬取的資料都在class=item的div裡
item = body.find_all('div',)
for i in range(len(item)):
#電影排名
em = item[i].find('em')
#電影名字
name = item[i].find('span',)
#電影別名
name_o = item[i].find('span',)
#簡評quote = item[i].find('span',)
if (quote is none):  #有些電影沒有簡評
else:
#評分star = item[i].find('span',)
#鏈結href = item[i].find('a')
#儲存資料
df = pd.dataframe()
df = df.set_index('排名') #將排名列設為索引列
df.to_excel("movietop250.xls",encoding='utf-8')
if __name__== "__main__":
main()

資料爬取成功後：

python爬蟲簡單 python爬蟲簡單版

學過python的帥哥都知道，爬蟲是python的非常好玩的東西，而且python自帶urllib urllib2 requests等的庫，為爬蟲的開發提供大大的方便。這次我要用urllib2，爬一堆風景。先上重點 1 response urllib2.urlopen url read 2 soup...

簡單python爬蟲

一段簡單的 python 爬蟲程式，用來練習挺不錯的。讀出乙個url下的a標籤裡href位址為.html的所有位址一段簡單的 python 爬蟲程式，用來練習挺不錯的。讀出乙個url下的a標籤裡href位址為.html的所有位址 usr bin python filename test.py im...

Python簡單爬蟲

一.獲取整個頁面的資料 coding utf 8 import urllib defgethtml url page urllib.urlopen url html page.read return html html gethtml print html 二.篩選需要的資料利用正規表示式來獲取想...

Python 簡單的爬蟲

python爬蟲簡單 python爬蟲 簡單版

簡單python爬蟲

Python簡單爬蟲

相關推薦

python爬蟲簡單 python爬蟲簡單版