Python實現從url中提取網域名稱的幾種方法

這篇文章主要介紹了python實現從url中提取網域名稱的幾種方法,本文給出了3種方法實現在url中提取網域名稱的需求,需要的朋友可以參考下。

從url中找到網域名稱,首先想到的是用正則，然後尋找相應的類庫。用正則解析有很多不完備的地方，url中有網域名稱，網域名稱字尾一直在不斷增加等。通過google查到幾種方法，一種是用python中自帶的模組和正則相結合來解析網域名稱，另一種是使第三方用寫好的解析模組直接解析出網域名稱。

要解析的url

複製**

**如下:

urls = ["",

"","",

"","""""",

"file:///d:/code/echarts-2.0.3/doc/example/tooltip.html",

"","",

"" ]

使用urlparse+正則的方式

複製**

**如下:

import re

from urlparse import urlparse

tophostpostfix = (

'.com','.la','.io','.co','.info','.net','.org','.me','.mobi',

'.us','.biz','.***','.ca','.co.jp','.com.cn','.net.cn',

'.org.cn','.mx','.tv','.ws','.ag','.com.ag','.net.ag',

'.org.ag','.am','.asia','.at','.be','.com.br','.net.br',

'.bz','.com.bz','.net.bz','.cc','.com.co','.net.co',

'.nom.co','.de','.es','.com.es','.nom.es','.org.es',

'.eu','.fm','.fr','.gs','.in','.co.in','.firm.in','.gen.in',

'.ind.in','.net.in','.org.in','.it','.jobs','.jp','.ms',

'.com.mx','.nl','.nu','.co.nz','.net.nz','.org.nz',

'.se','.tc','.tk','.tw','.com.tw','.idv.tw','.org.tw',

'.hk','.co.uk','.me.uk','.org.uk','.vg', ".com.hk")

regx = r'[^\.]+('+'|'.join([h.replace('.',r'\.') for h in tophostpostfix])+')$'

pattern = re.compile(regx,re.ignorecase)

print "--"*40

for url in urls:

parts = urlparse(url)

host = parts.netloc

m = pattern.search(host)

res = m.group() if m else host

print "unkonw" if not res else res

執行結果如下:

複製**

**如下:

meiwen.me

1000chi.com

see.xidian.edu.cn

python.org

google.com.hk

unkonw

mongodb.org

python.org

127.0.0.1:8000

基本可以接受

urllib來解析網域名稱

複製**

**如下:

import urllib

print "--"*40

for url in urls:

proto, rest = urllib.splittype(url)

res, rest = urllib.splithost(rest)

print "unkonw" if not res else res

執行結果如下：

複製**

**如下:

meiwen.me

1000chi.com

see.xidian.edu.cn

docs.python.org

www.google.com.hk

unkonw

api.mongodb.org

pypi.python.org

127.0.0.1:8000

會把www.也帶上，還需要進一步解析才可以

使用第三方模組 tld

複製**

**如下:

from tld import get_tld

print "--"*40

for url in urls:

try:

print get_tld(url)

except exception as e:

print "unkonw"

執行結果：

複製**

**如下:

meiwen.me

1000chi.com

xidian.edu.cn

python.org

google.com.hk

unkonw

mongodb.org

python.org

unkonw

結果都可以接受

其他可以使用的解析模組：

tldtldextract

publicsuffix

C 中文分詞演算法實現從文章中提取關鍵字演算法

using system using system.collections.generic using system.linq using system.text using system.collections using system.io using system.text.regularex...

hive 從url中提取需要的部分字串

事情是這樣的，hive的a表中，有url這樣的乙個字段，我想要提取這個欄位中的某一部分這不就是擷取字串嘛但是substring肯定是滿足不了我的需求的，自己寫hive的udf也不太現實用最簡單的方式完成任務，才會讓後來的維護變得更加方便，否則除了維護sql還要維護一堆udf，那才叫可怕因此我...

Python筆記從html中提取字段

def fun url url total 景區 page size 20 page num 1 ion 白山市 scope 2 output json ak 秘鑰 response requests.get url total url total html response.text print ...

Python實現從url中提取網域名稱的幾種方法

C 中文分詞演算法 實現從文章中提取關鍵字演算法

hive 從url中提取需要的部分字串

Python筆記從html中提取字段

相關推薦

C 中文分詞演算法實現從文章中提取關鍵字演算法