Stanford segmenter的簡單學習

這段時間開始學習中文分詞的原理，其目的也在於從最基礎的地方開始自然語言處理的學習。雖然中文分詞經過10多年的研究，已經很難在上面開花結果了。但我個人覺得這是最能鍛鍊自然語言基礎的地方。從hmm模型、maxent模型到crfs模型，中文分詞的研究，濃縮了自然語言處理的發展史。

使用的過程很簡單：

第二步：執行segdemo，run as-> run configurations,執行需要傳入引數，test.simp.utf8

由於stanford-sementer占用的記憶體比較大，所以需要設定vm arguments,不然就會超記憶體。

好了，接下來就是見證奇蹟的時刻了：

testfile=test.simp.utf8 serdictionary=data/dict-chris6.ser.gz sighancorporadict=data inputencoding=utf-8 sighanpostprocessing=true loading classifier from d:\workspace_vancl\stanfordsegmenter\data\ctb.gz ... loading chinese dictionaries from 1 files: data/dict-chris6.ser.gz loading dictionaries from data/dict-chris6.ser.gz...done. unique words in chinesedictionary is: 423200 done [26.8 sec]. info: tagaffixdetector: usechpos=false | usectbchar2=true | usepkchar2=false info: tagaffixdetector: building tagaffixdetector from data/dict/character_list and data/dict/in.ctb loading character dictionary file from data/dict/character_list loading affix dictionary from data/dict/in.ctb 面對新世紀，世界各國人民的共同願望是：繼續發展人類以往創造的一切文明成果，克服 20 世紀困擾著人類的戰爭和貧困問題，推進和平與發展的崇高事業，創造一個美好的世界。 crfclassifier tagged 80 words in 1 documents at 134.45 words per second.

看到這個結果，其實也好猜了，需要分詞的源語料就是傳入的引數檔案test.simp.utf8。

看到了結果，就可以關聯到源**，檢視分詞建模的細節了。就像騎自行車一樣，先騎一騎，有乙個直觀的印象，有興趣了，接下來的事情就好辦了！

其實crfs在《數學之美》中做的事情是句法分析，這也是自然語言處理的基礎，但是鼎鼎有名的stanford-parser用的卻不是crfs,而是概率上下文無關文法（pcfg）。

Stanford segmenter的簡單學習

密碼學簡單介紹

形態學簡單總結

Python自學簡單學元組

Stanford segmenter的簡單學習

密碼學簡單介紹

形態學簡單總結

Python自學 簡單學 元組

相關推薦

Python自學簡單學元組