資訊抽取之街道抽取

從給定的語料中抽取出相應的道路資訊。

資料

向塘北大道西50公尺天龍路與龍華路交叉口北50公尺觀瀾大道490號附近成都市錦江區海椒市街13號附7號玉蘭西路團結北路23號湖塘鎮火炬北路12號昆明市晉寧區莊蹺西路28 金水路合作路28-1號長公大道浙江顯家門業閬中總**旁安陽街道嶺下東路4號樓萬頃沙珠江街珠江東路169號 **大街萬達廣場a座一層a17 梅亭路18號民生銀行旁

北京市四川西路

輸出

向塘北大道西50公尺 -> 塘北大道
北京市四川西路 -> 四川西路

現有工具包先詞性標註，然後觀察資料總結規則。（通用方法+規則）

在1的基礎上，積累領域詞，然後利用這類詞標註一些資料，然後構建自己領域ner模型。

第一種方案實現：

"""
@desc:
en:the ****** code of road name extration
cn:簡單街道抽取指令碼
@author:peter
@mail:[email protected]
@date:2021/2/24
@note:
該指令碼可能存在問題，但由於目前資料就這麼多所以就先這樣吧，僅供參考。
"""import pkuseg
import jieba
from jieba import posseg
deftokenize_pku
(word)
:return tokenizer.cut(word)
deftokenize_jieba
(word)
:    ret = posseg.cut(word)
return
[(word.word,word.flag)
for word in ret]
use_pku_tokenizer=
true
tokenize =
none
if use_pku_tokenizer:
tokenizer = pkuseg.pkuseg(postag=
true
)    tokenize = tokenize_pku
else
:#jieba準確率有限。
tokenize = tokenize_jieba
data=
"""向塘北大道西50公尺
天龍路與龍華路交叉口北50公尺
觀瀾大道490號附近
成都市錦江區海椒市街13號附7號
玉蘭西路
團結北路23號
湖塘鎮火炬北路12號
昆明市晉寧區莊蹺西路28
金水路合作路28-1號
長公大道浙江顯家門業閬中總**旁
安陽街道嶺下東路4號樓
萬頃沙珠江街珠江東路169號
**大街萬達廣場a座一層a17
梅亭路18號民生銀行旁
北京市四川西路"""
.split(
"\n"
)data=
[line.strip(
)for line in data]
#資料pos_cands =
[tokenize(line)
for line in data]
road_keywords =
["街"
,"大道"
,"路",]
defcheck
(word)
:#檢測是否為街道路
for w in road_keywords:
if w in word:
return
true
return
false
defcheck_city
(word)
:#檢測是否為城市
keywords=
["省"
,"市"
,"區"
,"街道"
,"縣"
,"村"
,"鎮"
]for key in keywords:
if word.endswith(key)
:return
true
return
false
deffind_road
(pos_cands,verbose=
false):
"""    道路組合形式：n+n
v+ns
ns+n
ns+nsnsn
j+nn+n
args:
pos_cands:list,e.g. [("北京","ns")]
"""res =
pre_idx =-1
pre_pos =
""    text =
""if verbose:
print
(pos_cands)
for idx,
(word,pos)
inenumerate
(pos_cands)
:#過濾地區詞
if pos==
"ns"
and check_city(word)
:continue
#總結規律，寫規則
if pre_pos in
["v"
,"j"
,"n"
,"a"
,"ns"
]and pos in
["ns"
,"n"]:
if check(word)
:                text+=word  
text =
""                pre_pos=
""else
:                text=word
pre_pos=pos
pre_idx=idx
elif check(word)
and pos in
["ns"
,"n"]:
pre_idx=idx 
elif pos in
["v"
,"j"
,"n"
,"a"]:
# print(word)
text+=word  
pre_idx=idx
pre_pos=pos
elif pos in
["ns"
,"n"]:
# print(word)
text+=word  
pre_idx=idx
pre_pos=pos
else
:            pre_idx= idx
pre_pos=
""if text:
real_res =
for word in res:
for key in road_keywords:
if key in word:
return real_res
for cand in pos_cands:
print
(find_road(cand)
)# print(find_road(pos_cands[-4],true))

**在這也有乙份。

Python資訊抽取之亂碼解決

就事論事，直說自己遇到的情況，和我不一樣的路過吧，一樣的就看看吧資訊抓取，用python,beautifulsoup,lxml,re,urllib2,urllib2去獲取想要抽取的頁面內容，然後使用lxml或者beautifulsoup進行解析，插入mysql 具體的內容，好了貌似很簡單很easy...

Java抽取網頁資訊

使用正規表示式及字串操作，抽取網頁資訊去script public static string trimscript string content return result 去除注釋 public static string trimcomment string content return r...

資訊抽取短語提取

短語提取在資訊抽取領域，另一項重要的任務就是提取中文短語，也即固定多字詞表達串的識別。短語提取經常用於搜尋引擎的自動推薦，文件的簡介生成等。其顆粒度介於單詞和句子之間，nlp一系列任務的顆粒度排序如下短語的顆粒度短語提取利用互資訊和左右資訊熵，我們可以輕鬆地將新詞提取演算法拓展到短語提取。只...

資訊抽取之街道抽取

Python資訊抽取之亂碼解決

Java抽取網頁資訊

資訊抽取 短語提取

相關推薦

資訊抽取短語提取