簡單爬蟲架構

爬蟲架構

執行流程

網頁解析器

網頁解析器-beautifulsoup-語法

簡單解析例項1

1
from bs4 import
beautifulsoup
2import
re3 html_doc  = """45
6the documents's story78
once upon a time there were three littlesisters;and their name
9elsie
10lacie and 
11title;
12and they lived at the bottom of a well.
1314
...15
"""16 soup = beautifulsoup(html_doc,"
html.parser
",from_encoding='
utf8')
17print ('
獲取所有的連線')
18 links =  soup.find_all('a'
)19for link in
links:
20print (link.name,link['
href
'], link.get_text()) 
2122
print ('
獲取lacie的鏈結') 
23 link_node = soup.find('
a', href='
')24print  (link_node.name,link_node['
href
'],link_node.get_text())
2526
print ('
正則匹配')
27 link_node = soup.find('
a',href=re.compile(r"tl"
))28
print (link_node.name,link_node['
href
'], link_node.get_text())
2930
print ('
獲取p段落文字')
31 p_node = soup.find('
p', class_="
title")
32print (p_node.name, p_node.get_text())

簡單解析例項2

from bs4 import
beautifulsoup as bs
import
rehtml_doc = """
the dormouse's story
once upon a time there were three little sisters; and their names were
elsie,
lacie and
tillie;
and they lived at the bottom of a well.
..."""
#html.parser解析器解析
soup = bs(html_doc,"
html.parser")
print
(soup.prettify())
#獲取title標籤及內容
print
(soup.title) 
#獲取title標籤的內容
print
(soup.title.string) 
#獲取父標籤
print
(soup.title.parent.name) 
#獲得p標籤及內容
print
(soup.p)
#獲得p標籤class元素的內容
print(soup.p['
class'])
#獲取當前a標籤及內容
print
(soup.a)
'''soup.tag只能獲取當前標籤所有標籤當中的第乙個
'''#
獲取所有a標籤及內容
print(soup.find_all('a'
))#獲得 link元素所在的標籤及內容
print(soup.find(id='
link1'))
#獲取link元素所在標籤的內容
print(soup.find(id='
link1
').string)
#獲取所有a標籤下的鏈結和內容
for link in soup.find_all('a'
):    
print('
**為:
'+link.get('
href
')+'
內容為:
'+link.string)
#獲取p標籤裡 class 元素的值為story 下的所有標籤及內容
print(soup.find("
p",))
#獲取p標籤裡class元素的只為story下的所有內容
print(soup.find("
p",).get_text())
#獲取b開頭的標籤
for tag in soup.find_all(re.compile("^b"
)):    
print
(tag.name)
#獲取a標籤下href包含的所有標籤及內容
綜合例項-爬取維基百科詞條 
#!/usr/bin/env python
#-*- coding:utf-8 -*-
#引入開發包
from urllib.request import urlopen
from bs4 import beautifulsoup
import re
resp = urlopen("").read().decode('utf-8')
#使用beautifulsoup去解析
soup = beautifulsoup(resp,"html.parser")
#查詢以wiki開頭的鏈結
listurls=soup.findall("a",href=re.compile("^/wiki/"))
#輸出所有詞條對應的名稱和url
for url in listurls:
#過濾掉.以jpg或jpg結尾的鏈結
#輸出url的文字和對應的鏈結
print(url.get_text+''+url['href'])
				python爬蟲簡單 python爬蟲 簡單版
學過python的帥哥都知道，爬蟲是python的非常好玩的東西，而且python自帶urllib urllib2 requests等的庫，為爬蟲的開發提供大大的方便。這次我要用urllib2，爬一堆風景。先上重點 1 response urllib2.urlopen url read 2 soup...
				python爬蟲入門簡單爬蟲
coding utf 8 from bs4 import beautifulsoup,soupstrainer from threading import lock,thread import sys,time,os from urlparse import urlparse,urljoin fro...
				簡單python爬蟲
一段簡單的 python 爬蟲程式，用來練習挺不錯的。讀出乙個url下的a標籤裡href位址為.html的所有位址 一段簡單的 python 爬蟲程式，用來練習挺不錯的。讀出乙個url下的a標籤裡href位址為.html的所有位址 usr bin python filename test.py im...

簡單爬蟲架構

python爬蟲簡單 python爬蟲 簡單版

python爬蟲入門簡單爬蟲

簡單python爬蟲

相關推薦

python爬蟲簡單 python爬蟲簡單版