AC自動機1 適用於utf 8編碼的Trie樹

最近需要用到文字的拼音相似度計算，看了hankcs大神的hanlp裡面通過ac自動機實現拼音的儲存，想把它轉成python版本的。開始啃ac自動機吧。

ac自動機建立在trie樹和kmp字串匹配演算法。首先啃trie樹。

關於trie樹的概念，這一篇講得很好，還附贈了字尾樹。

我所要做的是把utf-8編碼的中文詞和拼音對應起來。utf-8編碼將乙個漢字編碼成3個byte，每個byte按照16進製制儲存。鑑於這種情況，需要構造乙個256 trie，即每一層可能有256個節點。

看了幾個程式後，集眾人智慧型，寫了乙個自己的。

# coding:utf-8
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
class trienode(object):
def __init__(self):
self.one_byte = {}
self.value = none
self.is_word = false
class trie256(object):
def __init__(self):
self.root = trienode()
def getutf8string(self, string):
bytes_array = bytearray(string.encode("utf-8"))
return bytes_array
def insert(self, bytes_array, str):
node = self.root
for byte in bytes_array:
child = node.one_byte.get(byte)
if child == none:
node.one_byte[byte] = trienode()
node = node.one_byte[byte]
node.is_word = true
node.value = str
def find(self, bytes_array):
node = self.root
for byte in bytes_array:
child = node.one_byte.get(byte)
if child == none:
print "no this word in this trie."
return none
node = node.one_byte[byte]
if not node.is_word:
print "it is not a word."
return none
else:
return node.value
def modify(self, bytes_array, str):
node = self.root
for byte in bytes_array:
child = node.one_byte.get(byte)
if child == none:
print "this word is not in this trie, we will insert it."
node.one_byte[byte] = trienode()
node = node.one_byte[byte]
if not node.is_word:
print "this word is not a word in this trie, we will make it a word."
node.is_word = true
node.value = str
else:
print "modify this word..."
node.value = str
def delete(self, bytes_array):
node = self.root
for byte in bytes_array:
child = node.one_byte.get(byte)
if child == none:
print "this word is not in this trie."
break
node = node.one_byte[byte]
if not node.is_word:
print "it is not a word."
else:
node.is_word = false
node.value = none
child = node.one_byte.keys()
if len(child) == 0:
node.one_byte.clear()
def print_item(self, p, indent=0):
if p:
ind = '' + '\t' * indent
for key in p.one_byte.keys():
label = "'%s' : " % key
print ind + label + ''
#self.print_item(p.one_byte[key], indent + 1)
if __name__ == "__main__":
trie = trie256()
with open("dictionary/pinyin.txt", 'r') as fd:
line = fd.readline()
while line:
line_split = line.split('=')
word = line_split[0]
pinyin = line_split[1].strip()
bytes = trie.getutf8string(word)
sentence = ''
for byte in bytes:
sentence = sentence + 'x' + str(byte)
print sentence
trie.insert(bytes, pinyin)
line = fd.readline()
trie.print_item(trie.root)
bytes = trie.getutf8string("一分鐘".decode("utf-8"))
for byte in bytes:
print byte
print trie.find(bytes)

AC自動機建立nlogn個AC自動機

string set queries 題意給你3種操作，1 加入乙個串到集合中。2 刪除集合中的某乙個串 3 查詢集合中的字串在給定的字串種出現幾次。同乙個串可重複解法建立多個ac自動機，用二進位制分組來處理。加入給你21個串分為 16 4 1，再新增乙個串的時候，即21 1，22 16 4...

AC自動機及字尾自動機

ac自動機是一種基於trie樹的演算法，其本質和kmp上的處理很相似。trie樹結構 kmp轉移思路 ac自動機組要由三個部分組成 trie樹的建立 fail指標的匹配對ac自動機的詢問每次建立自動機會有一次初始化 ac自動機類 struct node node結構體 struct ac voi...

AC自動機演算法

ac自動機簡介首先簡要介紹一下ac自動機 aho corasickautomation，該演算法在1975年產生於貝爾實驗室，是著名的多模匹配演算法之一。乙個常見的例子就是給出n個單詞，再給出一段包含m個字元的文章，讓你找出有多少個單詞在文章裡出現過。要搞懂ac自動機，先得有字典樹trie和kmp...

AC自動機1 適用於utf 8編碼的Trie樹

AC自動機 建立nlogn個AC自動機

AC自動機及字尾自動機

AC自動機演算法

相關推薦

AC自動機建立nlogn個AC自動機