那牆可有五十公尺高啊！

剛寫完用了兩天資料來源就被封了2333333

收到通知暫停更新，稍後會刪除該文，期待官方解禁。

簡單的讀頁面原始碼然後正則匹配。

我只是懶得自己更新hosts。

為了chrome的同步我嘔心瀝血。

請和之前的湯站爬蟲一起加進計畫任務裡。

面向過程充滿了愛。

1
#encoding:utf-8
2import
urllib
3importre4
importos5
6 url = '
'7 regexhosts = r'
#google hosts 2015 by 360kb.com.*#google hosts 2015 end
'8 regextimeupdated = r'
(\d\d\d\d\.\d\d?\.\d\d?)'
910 hostspath = '
c:\\windows\\system32\\drivers\\etc\\hosts'11
12def
retrievepage(url):
13'''
讀取頁面源**。 
'''14     response =urllib.urlopen(url)
15     page =response.read()
16return
page
1718
defmatchtimeupdated(page):
19'''
從頁面原始碼中匹配出hosts更新時間。 
'''20     timeupdated =re.search(regextimeupdated, page)
21return timeupdated.group(1)
2223
defmatchhostlist(page):
24'''
從頁面原始碼中匹配出host列表。 
'''25     result =re.search(regexhosts, page, re.s)
26     hosts =result.group()
27return
hosts
2829
deftranslatespaceentity(srcstring):
30'''
把結果中的" "轉換成空格。 
'''31
return srcstring.replace('
', '')
3233
defremovehtmllabels(srcstring):
34'''
去除結果中的html標籤。 
'''35
return re.sub(r'
]+>
', ''
, srcstring)
3637
defaddextrainfo(srcstring, extrainfo):
38'''
在第一行新增額外資訊。
'''39
return extrainfo + '
\n' +srcstring
4041
defwrite2file(hosts, filepath):
42'''
寫出到檔案。 
'''43     f = open(filepath, 'w'
)44f.write(hosts)
45f.close()
4647
defrun():
48'''
主執行函式。 
'''49     page =retrievepage(url)
5051     roughhosts =matchhostlist(page)
52     precisehosts =removehtmllabels(translatespaceentity(roughhosts))
5354     extrainfo = '''
#hosts updated at %s
55#script written by mlxy@ feel free to modify and distribute it.
56''' %matchtimeupdated(page)
57     hostswithextra =addextrainfo(precisehosts, extrainfo)
5859
write2file(hostswithextra, hostspath)
6061
if__name__ == '
__main__':
62     run()

而正規表示式則難寫的一逼

那牆可有五十公尺高啊！

相關推薦