本期圍繞jieba講乙個我遇到的實際問題,在同乙個服務裡,存在兩個不同介面a和b,都用到了jieba分詞,區別在於兩者需要呼叫不同的詞庫,巧合中,存在以下情況:
詞庫a:"幹拌麵"
詞庫b:"乾拌","面"
在服務啟動的時候,由於詞庫a優先被載入了,再去載入詞庫b的時候發現,並沒有載入成功:
介面a中:
jieba.load_userdict("a.txt")
介面b中:
jieba.load_userdict("b.txt")
結果發現,在切幹拌麵這個詞的時候,介面b中還是沒有切成功。其實每次在我們載入jieba的時候,可以注意一下會出現以下info:
building prefix dict from the default dictionary ...
dumping model to file cache /var/folders/hv/kfb7n4lj06590hqxjv6f3dd00000gn/t/jieba.cache
loading model cost 0.824 seconds.
prefix dict has been built succesfully.
顯而易見,先進行了building prefix dict,再dumping model to file cache,後續loading model都會來自這,所以這個地方導致以上問題。
我是這麼處理的:
介面a中:
jieba1 = jieba.tokenizer(dictionary="a.txt")
介面b中:
jieba2 = jieba.tokenizer(dictionary="b.txt")
案例如下:
in [1]: import jieba
in [2]: jieba1=jieba.tokenizer(dictionary="a.txt")
in [3]: jieba2=jieba.tokenizer(dictionary="b.txt")
in [4]: jieba1.lcut("幹拌麵")
building prefix dict from /users/slade/desktop/a.txt ...
dumping model to file cache /var/folders/hv/kfb7n4lj06590hqxjv6f3dd00000gn/t/jieba.u5221c1b70f06b36e44bc519f39715c96.cache
loading model cost 0.006 seconds.
prefix dict has been built succesfully.
out[4]: ['幹拌麵']
in [5]: jieba2.lcut("幹拌麵")
building prefix dict from /users/slade/desktop/b.txt ...
dumping model to file cache /var/folders/hv/kfb7n4lj06590hqxjv6f3dd00000gn/t/jieba.uc4f38d90bf7ce748744ff94fb2863fe4.cache
loading model cost 0.003 seconds.
prefix dict has been built succesfully.
out[5]: ['乾拌', '面']
需要注意的是,去看tokenizer原始碼,裡面有這麼一段讀取呼叫:
def gen_pfdict(self, f):
lfreq = {}
ltotal = 0
f_name = resolve_filename(f)
for lineno, line in enumerate(f, 1):
try:
line = line.strip().decode('utf-8')
word, freq = line.split(' ')[:2]
freq = int(freq)
lfreq[word] = freq
ltotal += freq
for ch in xrange(len(word)):
wfrag = word[:ch + 1]
if wfrag not in lfreq:
lfreq[wfrag] = 0
except valueerror:
raise valueerror(
'invalid dictionary entry in %s at line %s: %s' % (f_name, lineno, line))
f.close()
return lfreq, ltotal
在load_userdict的時候詞庫的詞頻可以省略不寫,word, freq = line.split(' ')[:2]
決定了這邊需要加上,這個依賴於版本,我並沒有實驗不同版本。
a.txt:
幹拌麵 1
b.txt:
乾拌 1
面 1
遷移填坑第二季
之前說到,配置了遷移環境碰到了各種坑,然後終於解決掉了,終於能夠nova live migration kobe compute5了。然後就開始批量生產遷移環境,然後。之前是只用了compute3和compute5,然後把compute6和compute7也配置好nfs和libvirt,然後嘗試把k...
Java 基礎(第二季)
public class helloworld public class helloworld int num1 int num2 初始化塊 static public static void main string args 結果如下 通過靜態初始化塊為靜態變數num3賦值 通過初始化塊為變數nu...
X A B (第二季水)
description give you two numbers a and b,if a is equal to b,you should print yes or print no input each test case contains two numbers a and b.output ...