命名實體識別（一）基於規則的命名實體識別

一、命名實體識別

首先，我們來認識一下命名實體識別的概念。命名實體識別（named entities recognition, ner）研究的命名實體一般分為3大類（實體類、時間類和數字類）和7小類（人名、地名、組織機構名、時間、日期、貨幣和百分比），研究的目的是將語料中的這些命名實體識別出來。主要有三種方式：

1）基於規則的命名實體識別：依賴於手工規則的系統，結合命名實體庫，對每條規則進行權重輔助，然後通過實體與規則的相符情況來進行型別判斷。大多數時候，規則往往依賴具體語言領域和文字風格，難以覆蓋所有的語言現象。

2）基於統計的命名實體識別：與分詞類似，目前流行的基於統計的命名實體識別方法有：隱馬爾科夫模型、條件隨機場模型等。主要思想是基於人工標註的語料，將命名實體識別任務作為序列標註問題來解決。該方法依賴於語料庫。

3）混合方法：自然語言處理並不完全是乙個隨機過程，單獨使用基於統計的方法使狀態搜尋空間非常大，必須借助規則知識提前進行過濾修剪處理，很多情況下，命名實體識別都是使用混合方法，結合規則和統計方法。

二、基於規則的命名實體識別

這裡我們以日期識別為例，詳細講述。假設資料需求背景為：抽取酒店預訂系統中日期類的資料，這些資料是非結構化資料，會出現各種形式的日期形式。我們的目的是識別出每條資料中可能的日期資訊，並轉換為統一的格式輸出。

首先通過jieba分詞將帶有時間資訊的詞進行切分，然後記錄連續時間資訊的詞，提取詞性為『m』(數字)、『t』（時間）的詞。

然後檢查日期的有效性，最後將提取到的日期轉換為統一格式。具體**如下：

# coding=utf-8
import datetime
import re
import jieba.posseg as psg
from  dateutil import parser                 #就是與日期相關庫里的乙個日期解析器 能夠將字串 轉換為日期格式
util_cn_num = 
util_cn_unit = 
#檢查抽取的日期有效性
def check_time_valid(word):
m = re.match("\d+$", word)
if m:
if len(word) <= 6:
print(word)
return none
word1 = re.sub('[號|日]\d+$', '日', word)
if word1 != word:
return check_time_valid(word1)
else:
return word1
def cn2dig(src):
if src == "":
return none
m = re.match("\d+", src)
if m:
return int(m.group(0))
rsl = 0
unit = 1
for item in src[::-1]:
if item in util_cn_unit.keys():
unit = util_cn_unit[item]
elif item in util_cn_num.keys():
num = util_cn_num[item]
rsl += num * unit
else:
return none
if rsl < unit:
rsl += unit
return rsl
def year2dig(year):
res = ''
for item in year:
if item in util_cn_num.keys():
res = res + str(util_cn_num[item])
else:
res = res + item
m = re.match("\d+", res)
if m:
if len(m.group(0)) == 2:
return int(datetime.datetime.today().year/100)*100 + int(m.group(0))
else:
return int(m.group(0))
else:
return none
def parse_datetime(msg):
if msg is none or len(msg) == 0:
return none
try:
dt = parser.parse(msg)
return dt.strftime('%y-%m-%d %h:%m:%s')
except exception as e:
m = re.match(
r"([0-9零一二兩三四五六七**十]+年)?([0-9一二兩三四五六七**十]+月)?([0-9一二兩三四五六七**十]+[號日])?([上中下午晚早]+)?([0-9零一二兩三四五六七**十百]+[點:\.時])?([0-9零一二三四五六七**十百]+分)?([0-9零一二三四五六七**十百]+秒)?",
msg)
if m.group(0):
res = 
params = {}
#print(m.group())
for name in res:
if res[name] is not none and len(res[name]) != 0:
tmp = none
if name == 'year':
tmp = year2dig(res[name][:-1])
else:
tmp = cn2dig(res[name][:-1])
if tmp is not none:
params[name] = int(tmp)
target_date = datetime .datetime.today().replace(**params)          #替換給定日期，但不改變原日期
is_pm = m.group(4)
if is_pm is not none:
if is_pm == u'下午' or is_pm == u'晚上' or is_pm == '中午':
hour = target_date.time().hour
if hour < 12:
target_date = target_date.replace(hour=hour + 12)
#print(target_date)
return target_date.strftime('%y-%m-%d %h:%m:%s')
#return m.group(0)                      #group() 同group（0）就是匹配正規表示式整體結果, group(1) 列出第乙個括號匹配部分，group(2) 列出第二個括號匹配部分，group(3) 列出第三個括號匹配部分
else:
return none
def time_extract(text):
time_res = 
word = ''
keydate = 
for k, v in psg.cut(text):                 #k是切分的詞，v是對應詞的詞性
if k in keydate:
if word != '':
word = (datetime.datetime.today() + datetime.timedelta(days=keydate.get(k, 0))).strftime('%y%m%d') .format(y='年',m='月',d='日')
elif word != '':
if v in ['m', 't']:
word = word + k
else:
word = ''
elif v in ['m', 't']:
word = k
if word != '':
result = list(filter(lambda x: x is not none, [check_time_valid(w) for w in time_res]))
final_res = [parse_datetime(w) for w in result]
return [x for x in final_res if x is not none]
if __name__ =='__main__':
text1='我要預定明天下午3點到後天下午的6個房間。'
text2='我要預定5月1號至五月5號的房間'
time_deal=time_extract(text1)
print(text1,':',time_deal )
time_deal2 = time_extract(text2)
print(text2, ':', time_deal2)

最後輸出結果為：

我要預定明天下午3點到後天下午的6個房間。 : ['2020-04-10 15:00:00', '2020-04-11 12:00:00']

我要預定5月1號至五月5號的房間 : ['2020-05-01 00:00:00', '2020-05-05 00:00:00']

命名實體識別（一）基於規則的命名實體識別

ai命名實體識別模型命名實體識別

命名實體中文命名實體識別簡介

命名實體識別

命名實體識別（一） 基於規則的命名實體識別

ai命名實體識別模型 命名實體識別

命名實體 中文命名實體識別簡介

命名實體識別

相關推薦

命名實體識別（一）基於規則的命名實體識別

ai命名實體識別模型命名實體識別

命名實體中文命名實體識別簡介