情感分類 example

title: 情感分類–example

tags: nltk

最近在讀《natural language processing with python》，重點是學習如何進行文字的情感分類。在學習了一些簡單的python和nltk後，實踐第六章中的文件分類。由於自己用的是python3.5，而書中**是python2.7的，其中部分函式有所變化，debug的過程也踩了不少坑，記錄在這裡。下面就一步一步的踩坑。

import nltk
from nltk.corpus import movie_reviews
import random

發現比書中多了兩行，不新增第一行無法呼叫後面的一些函式：nltk.freqdist、nltk.*****bayesclassifier.train、nltk.classify.accuracy，第三行是一樣的原因：random.shuffle。其實這個在書的其他地方已經提到了。

documents =[(list(movie_reviews.words(fileid)),category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

載入語料庫，documents的格式是這樣的：

all_words = nltk.freqdist(w.lower() for w in movie_reviews.words())
word_features=[word for (word, freq) in all_words.most_common(2000)]

構建整個語料庫中前2000個最頻繁詞的鍊錶作為特徵。

第二句就是乙個大坑了。word_features =all_words.keys()[:2000]原句是這個，但是python3.5中dict.keys()並不支援該用法。有人將其改為word_features =list(all_words.keys())[:2000]，這樣子確實可以通過，但是你會發現現在的word_features並不是前2000個最頻繁詞。然後，我們可以選用這個函式：dict.most_common(2000)來達到選取前2000最頻繁詞的目的。但是新的問題又來了，此函式返回的格式是這樣的：

然而我們後面只需要詞，不需要詞頻。所以[word for (word, freq) in all_words.most_common(2000)]達到最終目的。

def
document_features
(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features

定義了乙個函式，作用是提取文件特徵。就是看文件中的每一句話是否包含前2000最頻繁詞。print (document_features(movie_reviews.words('pos/cv957_8737.txt')))可以直**一下函式的輸出，與整個分類無關，注意pythom3.5與2.7中print函式用法的區別。

featuresets = [(document_features(d), c) for (d,c) in documents]

提取documents中的特徵，迴圈是因為上面提到的格式原因。

train_set,test_set=featuresets[100:],featuresets[:100]
classifier=nltk.*****bayesclassifier.train(train_set)
print (nltk.classify.accuracy(classifier,test_set))

劃分訓練集和測試集，訓練分類器，測試結果。

情感分類 example

分類情感分析

情感分類中文語料

python 文字情感分類

情感分類 example

分類 情感分析

情感分類 中文語料

python 文字情感分類

相關推薦

分類情感分析

情感分類中文語料