關於語料庫中OOV未登入詞的處理方式

在word2vec訓練出來的詞向量語料庫中，對oov問題是無法解決的，

# 載入模型
wvmodel = gensim.models.word2vec.load(r'e:\pycharmproject_new\word2vectest\model\wiki_corpus.bin'
)# remove out-of-vocabulary words.
len_pre_oov1 =
len(document1)
len_pre_oov2 =
len(document2)
document1 =
[token for token in document1 if token in wvmodel]
document2 =
[token for token in document2 if token in wvmodel]
diff1 = len_pre_oov1 -
len(document1)
diff2 = len_pre_oov2 -
len(document2)
if diff1 >
0or diff2 >0:
logger.info(
'removed %d and %d oov words from document 1 and 2 (respectively).'
, diff1, diff2)

其實就是把未登入詞直接在分布式表示之前就移除了。

fasttext中在遇到未登入詞時會使用乙個相似向量來表示未登入詞。

自然語言處理中語料庫的理解

語料庫中存放的是在語言實際使用中真實出現過的語言材料語料庫是以電子計算機為載體承載語言知識的基礎資源真實語料需要經過加工分析和處理才能成為有用的資源。語料庫 corpus，複數corpora 指經科學取樣和加工的大規模電子文字庫。借助計算機分析工具，研究者可開展相關的語言理論及應用研究確定...

NLP中資料集（語料庫）中的概率統計方法

一 nlp nlpnl p中的一維隨機變數x xx的概率分布統計方法假設我們的語料庫為乙個字串 str ab caab a str abcaaba str a bcaa ba 那麼x xx的可能取值為 x a x b x c x a,x b,x c x a,x b,x c我們假設有乙個滑動視窗，視...

簡易中文自動文摘系統（二）中文語料庫的準備

bzcat zhwiki latest pages articles.xml.bz2 python wikiextractor.py b 1000m o extracted output.txt其中 b 1000m是將文字以1000m大小為單位進行分割 output.txt儲存的是輸出過程中日誌資訊...

關於語料庫中OOV未登入詞的處理方式

自然語言處理中語料庫的理解

NLP中資料集（語料庫）中的概率統計方法

簡易中文自動文摘系統（二） 中文語料庫的準備

相關推薦

簡易中文自動文摘系統（二）中文語料庫的準備