DFA 演算法實現關鍵詞匹配

ahocorasick

esmre

但是其實包都是基於dfa 實現的

這裡提供原始碼如下:

#!/usr/bin/python2.6  
# -*- coding: utf-8 -*-
import time
class
node
(object):
def__init__
(self):
self.children = none
self.flag = false
# the encode of word is utf-8
defadd_word
(root,word):
if len(word) <= 0:
return    
node = root
for i in range(len(word)):
if node.children == none:
node.children = {}
node.children[word[i]] = node()
elif word[i] not
in node.children:
node.children[word[i]] = node()
node = node.children[word[i]]
node.flag = true
definit
(word_list):
root = node()
for line in word_list:
add_word(root,line)
return root
# the encode of word is utf-8
# the encode of message is utf-8
defkey_contain
(message, root):
res = set() 
for i in range(len(message)):
p = root
j = i
while (jand p.children!=none
and message[j] in p.children):
if p.flag == true:
res.add(message[i:j])
p = p.children[message[j]]
j = j + 1
if p.children==none:
res.add(message[i:j])
#print '---word---',message[i:j]
return res 
defdfa
():print
'----------------dfa-----------'
word_list = ['hello', '民警', '朋友','女兒','派出所', '派出所民警']
root = init(word_list)
message = '四處亂咬亂吠，嚇得家中11歲的女兒躲在屋裡不敢出來，直到轄區派出所民警趕到後，才將孩子從屋中救出。最後在徵得主人同意後，民警和村民合力將這只發瘋的狗打死'
x = key_contain(message, root)    
for item in x:
print item
if __name__ == '__main__':
dfa()

請再閱讀我的這篇文章

hive like關鍵詞模糊匹配

select a.code,a.region code,a.name from hangzhou a companyname b where a.name like b.key 或者類似其他的我們可以直接你要匹配的字段但是在hive裡面不行，因為轉義了，需要自定義udf去完成這個操作！selec...

TF IDF演算法實現關鍵詞抽取

tf idf具體演算法如下 tfidfi,j tfi,j idfi 其中tfidf i,j 是指詞i 相對於文件j的重要性值。tf i,j 指的是某個給定的詞語在指定文件中出現的次數佔比。即給定的詞語在該文件中出現的頻率。這個數字是對term count的歸一化，防止它偏向長文件。計算公式為 tf ...

關鍵詞提取演算法

傳統的tf idf演算法中，僅考慮了詞的兩個統計資訊出現頻次在多少個文件出現因此，其對文字的資訊利用程度顯然也是很少的。演算法本身的定義是死的，但是結合我們的應用場景，對演算法進行合適的重塑及改造，使之更適應對應場景的應用環境，無疑能對我們想要得到的結果起到更好的指導作用。textrank演算...

DFA 演算法實現關鍵詞匹配

hive like關鍵詞模糊匹配

TF IDF演算法實現關鍵詞抽取

關鍵詞提取演算法

相關推薦