不想為了面試而面試,找實習的事還是順其自然,每天刷刷題就行,這樣整天都在看水題效率極低,也水不了幾題。還是得學點有用的東西:
9-8日目標:搞清楚mmseg演算法,分別用python和c++實現。
mmseg演算法簡介
其關鍵是:
1.匹配3個詞得到的片語長度盡量要長
2.每個詞也要盡可能長
3.每個詞要盡可能長度接近
4.單個詞的詞頻也要較為接近
python實現:詞頻直接用dict來儲存,詞典用trie樹,直接簡單粗暴的用樹結構。
對照著**,勉強調通了程式,但有些地方還沒理解透徹:
# -*- coding: utf-8 -*-
import codecs
import sys
from math import log
from collections import defaultdict
class trie(object):
class trienode(object):
def __init__(self):
self.val = 0
self.trans = {}
def __init__(self):
self.root = trie.trienode()
def add(self, word, value=1):
curr_node = self.root
for ch in word:
if ch in curr_node.trans:
curr_node = curr_node.trans[ch]
else:
curr_node.trans[ch] = trie.trienode()
curr_node = curr_node.trans[ch]
curr_node.val = value
# print len(self.root.trans)
def __walk(self, trie_node, ch):
if ch in trie_node.trans:
trie_node = trie_node.trans[ch]
return trie_node, trie_node.val
else:
return none, 0
def match_all(self, word):
"""返回trie樹中的匹配
"""ret =
curr_node = self.root
# print word
for ch in word:
curr_node, val = self.__walk(curr_node, ch)
if not curr_node:
break
# print ch, val
if val:
# print "match: " + str(ret)
return ret
class dict(trie):
def __init__(self, fname):
super(dict, self).__init__()
self.load(fname)
def load(self, fname):
with codecs.open(fname, 'r', 'utf-8') as f:
for line in f:
word = line.strip()
self.add(word, word)
# print word
# print_root_ch(self.root)
# break
class charfreqs(defaultdict):
def __init__(self, fname):
super(charfreqs, self).__init__(lambda: 1) # 詞頻初始為0
self.load(fname)
def load(self, fname):
with codecs.open(fname, 'r', 'utf-8') as f:
for line in f:
ch, freq = line.strip().split()
self[ch] = freq
class mmseg(object):
class chunk(object):
"""片語
"""def __init__(self, words, chs):
self.words = words
# 每個詞的長度
self.lens = map(lambda x: len(x), words)
# 次的個數
self.length = sum(self.lens)
# 平均長度
self.mean = float(self.length) / len(words)
# 方差
self.var = sum(map(lambda x: (x - self.mean) ** 2, self.lens)) / len(words)
# 單字詞頻
self.degree = sum([log(float(chs[x])) for x in words if len(x) == 1])
def __str__(self):
return ' '.join(self.words).encode('utf-8') + "(%f %f %f %f)" % \
(self.length, self.mean, self.var, self.degree)
def __lt__(self, other):
return (self.length, self.mean, -self.var, self.degree) < \
(other.length, other.mean, -other.var, other.degree)
def __init__(self, dic, chs):
self.dic = dic
self.chs = chs
def __get_chunks(self, s, depth=3):
ret =
def __get_chunks_it(s, num, segs):
if (num == 0 or not s) and segs:
else:
m = self.dic.match_all(s)
# print "s: " + s
# print m
if not m:
__get_chunks_it(s[1:], num - 1, segs + [s[0]])
for w in m:
__get_chunks_it(s[len(w):], num - 1, segs + [w])
__get_chunks_it(s, depth, )
return ret
def segment(self, s):
while s:
chunks = self.__get_chunks(s)
best = max(chunks)
yield best.words[0]
s = s[len(best.words[0]):]
def test():
words = ['字段', '找到', '知道']
lens = map(lambda x: len(x), words)
print lens
length = sum(lens)
print length
def print_root_ch(root):
for k in root.trans.keys():
print k,
print ''
def check_exist(root, c):
if c in root.trans:
print u'存在:' + c
def debug_helper(dic):
print_root_ch(dic.root)
check_exist(dic.root, u'一')
check_exist(dic.root, u'簡')
root = dic.root.trans[u'簡']
print_root_ch(root)
check_exist(root, u'單')
rt = root.trans[u'單']
print rt.val
if __name__ == '__main__':
chs = charfreqs('dict/data/chars.dic')
dic = dict('dict/data/words.dic')
mmseg = mmseg(dic, chs)
# debug_helper(dic)
# 每次分一句話:
str1 = u"簡單的正向匹配"
for w in mmseg.segment(str1):
print w
# test()
MMSEG分詞演算法
最近看了下mmseg分詞演算法,覺得這個演算法簡單高效,而且還非常準確 作者聲稱這個規則達到了99.69 的準確率並且93.21 的歧義能被這個規則消除。核心思想是抽取3個可能的詞 存在多個組合 然後根據4個消歧義規則確定到底選擇那個組合 1.組合長度最大 2.組合中平均詞語長度最大 3.詞語長度的...
Mmseg中文分詞演算法解析
mmseg中文分詞演算法解析 author linjiexing 開發中文搜尋和中文詞庫語義自動識別的時候,我採用都是基於mmseg 中文分詞演算法開發的 jcseg 開源工程。使用場景涉及搜尋索引建立時的中文分詞 新詞發現的中文分詞 語義詞向量空間構建過程的中文分詞和文章特徵向量提取前的中文分詞等...
為coreseek新增mmseg分詞
1.準備好需要新增的詞表,一般都是每行一詞,注意要儲存為utf 8 例如 林書豪 2.利用ultraedit的查詢替換功能,使詞 式符合mmseg的要求 例如 開啟ultraedit的正則替換功能,將 p 替換為 t1 px 1 p 結果是 林書豪 tab 1 x 1 其他的也行 3.將生成的符合格...