爬蟲架構
執行流程
網頁解析器
網頁解析器-beautifulsoup-語法
簡單解析例項1
1簡單解析例項2from bs4 import
beautifulsoup
2import
re3 html_doc = """45
6the documents's story78
once upon a time there were three littlesisters;and their name
9elsie
10lacie and
11title;
12and they lived at the bottom of a well.
1314
...15
"""16 soup = beautifulsoup(html_doc,"
html.parser
",from_encoding='
utf8')
17print ('
獲取所有的連線')
18 links = soup.find_all('a'
)19for link in
links:
20print (link.name,link['
href
'], link.get_text())
2122
print ('
獲取lacie的鏈結')
23 link_node = soup.find('
a', href='
')24print (link_node.name,link_node['
href
'],link_node.get_text())
2526
print ('
正則匹配')
27 link_node = soup.find('
a',href=re.compile(r"tl"
))28
print (link_node.name,link_node['
href
'], link_node.get_text())
2930
print ('
獲取p段落文字')
31 p_node = soup.find('
p', class_="
title")
32print (p_node.name, p_node.get_text())
from bs4 importbeautifulsoup as bs
import
rehtml_doc = """
the dormouse's story
once upon a time there were three little sisters; and their names were
elsie,
lacie and
tillie;
and they lived at the bottom of a well.
..."""
#html.parser解析器解析
soup = bs(html_doc,"
html.parser")
(soup.prettify())
#獲取title標籤及內容
(soup.title)
#獲取title標籤的內容
(soup.title.string)
#獲取父標籤
(soup.title.parent.name)
#獲得p標籤及內容
(soup.p)
#獲得p標籤class元素的內容
print(soup.p['
class'])
#獲取當前a標籤及內容
(soup.a)
'''soup.tag只能獲取當前標籤所有標籤當中的第乙個
'''#
獲取所有a標籤及內容
print(soup.find_all('a'
))#獲得 link元素所在的標籤及內容
print(soup.find(id='
link1'))
#獲取link元素所在標籤的內容
print(soup.find(id='
link1
').string)
#獲取所有a標籤下的鏈結和內容
for link in soup.find_all('a'
):
print('
**為:
'+link.get('
href
')+'
內容為:
'+link.string)
#獲取p標籤裡 class 元素的值為story 下的所有標籤及內容
print(soup.find("
p",))
#獲取p標籤裡class元素的只為story下的所有內容
print(soup.find("
p",).get_text())
#獲取b開頭的標籤
for tag in soup.find_all(re.compile("^b"
)):
(tag.name)
#獲取a標籤下href包含的所有標籤及內容
綜合例項-爬取維基百科詞條
#!/usr/bin/env python#-*- coding:utf-8 -*-
#引入開發包
from urllib.request import urlopen
from bs4 import beautifulsoup
import re
resp = urlopen("").read().decode('utf-8')
#使用beautifulsoup去解析
soup = beautifulsoup(resp,"html.parser")
#查詢以wiki開頭的鏈結
listurls=soup.findall("a",href=re.compile("^/wiki/"))
#輸出所有詞條對應的名稱和url
for url in listurls:
#過濾掉.以jpg或jpg結尾的鏈結
#輸出url的文字和對應的鏈結
print(url.get_text+''+url['href'])
python爬蟲簡單 python爬蟲 簡單版
學過python的帥哥都知道,爬蟲是python的非常好玩的東西,而且python自帶urllib urllib2 requests等的庫,為爬蟲的開發提供大大的方便。這次我要用urllib2,爬一堆風景。先上重點 1 response urllib2.urlopen url read 2 soup...
python爬蟲入門簡單爬蟲
coding utf 8 from bs4 import beautifulsoup,soupstrainer from threading import lock,thread import sys,time,os from urlparse import urlparse,urljoin fro...
簡單python爬蟲
一段簡單的 python 爬蟲程式,用來練習挺不錯的。讀出乙個url下的a標籤裡href位址為.html的所有位址 一段簡單的 python 爬蟲程式,用來練習挺不錯的。讀出乙個url下的a標籤裡href位址為.html的所有位址 usr bin python filename test.py im...