python入門學習之HTML解析

這是參照廖雪峰老師的教程(鏈結)寫的抓取網頁特定資訊的**。感覺**還不夠簡練，日後需要優化。

import urllib3
from htmlparser import htmlparser
from htmlentitydefs import name2codepoint, entitydefs
class myhtmlparser(htmlparser):
def __init__(self):
htmlparser.__init__(self)
self.h3 = none
self.time = none
self.span = none
self.speci = false
self.date = ''
def handle_starttag(self, tag, attrs):
if 'h3'==tag:
for href, link in attrs:
if 'event-title'==link:
self.h3 = tag
if 'time'==tag:
for href, link in attrs:
if 'datetime'==href:
self.time = tag
self.date += link
if 'span'== tag:
for href, link in attrs:
if 'event-location'==link:
self.span = tag
def handle_endtag(self, tag):
pass#print('' % tag)
def handle_startendtag(self, tag, attrs):
pass#print('<%s/>' % tag)
def handle_data(self, data):
if 'h3'==self.h3:
print'conference title:',data
self.h3 = none
if 'time'==self.time:
self.date+=data
if self.speci:
self.date += data
self.speci = false
self.time = none
print self.date
self.date = ''
if 'span'==self.span:
print 'event-location:',data
self.span = none
print '\n'
def handle_comment(self, data):
pass#print('')
def handle_entityref(self, name):
if 'ndash'==name:
self.date+='-'
self.speci = true
def handle_charref(self, name):
pass
parser = myhtmlparser()
url = ""
response = urllib2.urlopen(url)
page = response.read()
parser.feed(page)

注意：類中有個self.date 變數，我讓它初始化為空的字串，但在中間的函式處理過程中我無意將其賦值為none，導致我之後想做『+』的字串連線老是報告錯誤。

抓取網頁特定內容的**。由於知識累積的少，**中對於特殊字元'&ndash'的解析感覺不是很完美，需以後改進。

HTML入門之head標籤學習

主要是配置瀏覽器顯示資料的配置資訊 eg 字串編碼一般是給瀏覽器進行使用網頁標題標籤告訴瀏覽器使用什麼標題顯示網頁標題名編碼格式標籤告訴瀏覽器使用指定的編碼格式解析文件編碼格式 text html charset 編碼格式 html4 網頁搜尋優化標籤提公升網頁在瀏覽器中的搜尋速度 ...

html入門之屬性

目標 1，屬性 1，屬性什麼是屬性？生活中對乙個物品名詞或者形容詞的描述，統稱為屬性，比如說水杯，圓柱體的，黑色的，玻璃材質的等等都是對它的描述。那麼在html中，可以把標籤看做是乙個物體，它也有自己名詞或者形容的描述，統稱為屬性。語法標籤名屬性名1 值1 屬性名2 值2 內容注多個屬性用...

html之快速入門

目標 1，什麼是html？2，html常用標籤 3，文字類標籤特殊符號 1，什麼是html?1，英文名字 hypertext markup language 中文名字超文字標記語言標記使用帶尖括號的標記，將網頁中的內容逐一標識出來這個帶簡括的標記也叫做標籤。2，用來設計網頁的一門標記語言...

python入門學習之HTML解析

HTML入門之head標籤學習

html入門之屬性

html之快速入門

相關推薦