簡易爬蟲爬取豆瓣電影top250

此爬蟲簡單到不能再簡單了，主要內容就是爬取豆瓣top250電影頁面的內容，然後將該內容匯入了資料庫。下面先上結果圖：

def
getlist
(listurl, result):
time.sleep(2)
res = requests.get(listurl, headers=headers)
soup = beautifulsoup(res.text, 'html.parser')
movielist = soup.select('.grid_view li')
for m in movielist:
rank = m.select('em')[0].text
if len(m.select('.title')) > 1:
english_name = m.select('.title')[1].text.strip().strip('/').strip()
else:
english_name = "no info"
chinese_name = m.select('.title')[0].text
info_str = m.select('.info .bd p')[0].text.strip().replace(u'\xa0', u' ')
info_list = info_str.split('\n')
time_list = info_list[1].strip().split('/')
movie_time = time_list[0].strip()
movie_place = time_list[1].strip()
movie_type = time_list[2].strip()
director_list = info_list[0].strip(u'導演:').split('  ')
director = director_list[0].strip()
if len(director_list) > 1:
main_actor = director_list[1].strip().strip(u"主演:").strip()
else:
main_actor = u"暫無資訊"
if m.select('.inq'):
comments = m.select('.inq')[0].text.strip()
else:
comments = 'none'
data_movies = (rank, chinese_name, english_name, director, main_actor, movie_time,
movie_place, movie_type, comments)
if soup.select(u'.next a'):
asoup = soup.select(u'.next a')[0][u'href']
next_page = lurl + asoup
getlist(next_page, result)
else:
print('done')
return result, movie

返回的resutl以及movie都是列表，result用來儲存儲存在資料庫中的內容，movie用來儲存寫入檔案中的內容。之所以分開儲存是因為，寫入檔案的每個元素都要加上諸如」導演「此類的說明詞彙，以便於理解;而資料庫已經有了列名，所以不需要這些說明詞彙。

#連線資料庫
db = mysqldb.connect(host="localhost", user="root", passwd="", db="spider", use_unicode=true, charset="utf8")
cursor = db.cursor()
cursor.execute("drop table if exists movie")
sql = """create table movie (
rank int(4),
chinese_name char(100),
english_name char(100),
director char(100),
main_actors char(100),
time char(100),
place char(100),
type char(100),
comment char(100) )"""
cursor.execute(sql)
lurl = ''
movie = 
result = 
result, movies = getlist(lurl, result)
print(len(result))
#插入獲取的內容到資料庫
cursor.executemany(
"""insert into movie (rank, chinese_name, english_name, director, main_actors, time, place, type, comment) 
values (%s, %s, %s, %s, %s, %s, %s, %s, %s)""",
result
)db.commit()  #commit之後資料庫才會改動
cursor.close()
db.close()

爬蟲部分：由於頁面顯示千差萬別，所以爬蟲部分**最開始不要有對內容太細化的處理。

太細化的處理會導致某些小問題的出現，以至於爬蟲不能正常進行。

資料庫匯入部分：資料庫匯入出現的錯誤大多就是編碼錯誤，所以注意這些就行了。

爬蟲教程用Scrapy爬取豆瓣TOP250

文章首發於 guanngxu 的個人部落格用scrapy爬取豆瓣top250 最好的學習方式就是輸入之後再輸出，分享乙個自己學習scrapy框架的小案例，方便快速的掌握使用scrapy的基本方法。本想從零開始寫乙個用scrapy爬取教程，但是官方已經有了樣例，一想已經有了，還是不寫了，盡量分享在網...

Python小爬蟲抓取豆瓣電影Top250資料

寫leetcode太累了，偶爾練習一下python，寫個小爬蟲玩一玩比較簡單，抓取豆瓣電影top250資料，並儲存到txt 上傳到資料庫中。通過分析可以發現，不同頁面之間是有start的值在變化，其他為固定部分。以物件導向的編碼方式編寫這個程式，養成好的編碼習慣。基本資訊在 init 函式中初始化...

python練習簡單爬取豆瓣網top250電影資訊

因為有的電影詳情裡沒有影片的又名，所以沒有爬取電影的又名。基本思路爬取top250列表頁展示中電影的排行榜排名，電影詳情鏈結，電影名稱。然後通過電影鏈結進入到詳情頁，獲取詳情頁的原始碼，再進行爬取，爬取後的資料儲存在字典中，通過字典儲存在mongo資料庫中的。from urllib.request...

簡易爬蟲 爬取豆瓣電影top250

爬蟲教程 用Scrapy爬取豆瓣TOP250

Python小爬蟲 抓取豆瓣電影Top250資料

python練習簡單爬取豆瓣網top250電影資訊

相關推薦

簡易爬蟲爬取豆瓣電影top250

爬蟲教程用Scrapy爬取豆瓣TOP250

Python小爬蟲抓取豆瓣電影Top250資料