kaggle imdb影評者情感褒貶分類問題,kaggle位址為
原文使用的方法是word2vec將詞語轉為詞向量,再用deep learning方式處理,我們這裡使用tf-idf作為特徵,用最簡單的樸素貝葉斯和邏輯回歸嘗試
import re #正規表示式
from bs4 import beautifulsoup #html標籤處理
import pandas as pd
def review_to_wordlist(review):
''''''
# 去掉html標籤,拿到內容
review_text = beautifulsoup(review).get_text()
# 用正規表示式取出符合規範的部分
review_text = re.sub("[^a-za-z]"," ", review_text)
# 小寫化所有的詞,並轉成詞list
words = review_text.lower().split()
# 返回words
return words
# 使用pandas讀入訓練和測試csv檔案
train = pd.read_csv('/users/hanxiaoyang/imdb_sentiment_analysis_data/labeledtraindata.tsv', header=0, delimiter="\t", quoting=3)
test = pd.read_csv('/users/hanxiaoyang/imdb_sentiment_analysis_data/testdata.tsv', header=0, delimiter="\t", quoting=3 )
# 取出情感標籤,positive/褒 或者 negative/貶
y_train = train['sentiment']
# 將訓練和測試資料都轉成詞list
train_data =
for i in xrange(0,len(train['review'])):
test_data =
for i in xrange(0,len(test['review'])):
train_data
y_train
0 1
1 1
2 0
3 0
4 1
5 1
6 0
7 0
8 0
9 1
10 0
11 1
12 1
13 0
14 0
15 0
16 0
17 0
18 1
19 1
20 1
21 1
22 1
23 0
24 0
25 1
26 0
27 0
28 0
29 0
..24970 1
24971 1
24972 1
24973 0
24974 1
24975 1
24976 0
24977 1
24978 1
24979 1
24980 1
24981 1
24982 0
24983 0
24984 0
24985 0
24986 1
24987 1
24988 1
24989 1
24990 1
24991 0
24992 0
24993 0
24994 0
24995 0
24996 0
24997 0
24998 0
24999 1
name: sentiment, dtype: int64
from sklearn.feature_extraction.text import tfidfvectorizer as tfiv
# 初始化tfiv物件,去停用詞,加2元語言模型
tfv = tfiv(min_df=3, max_features=none, strip_accents='unicode', analyzer='word',token_pattern=r'\w', ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1, stop_words = 'english')
# 合併訓練和測試集以便進行tfidf向量化操作
x_all = train_data + test_data
len_train = len(train_data)
# 這一步有點慢,去喝杯茶刷會兒微博知乎歇會兒...
tfv.fit(x_all)
x_all = tfv.transform(x_all)
# 恢復成訓練集和測試集部分
x = x_all[:len_train]
x_test = x_all[len_train:]
# 多項式樸素貝葉斯
from sklearn.*****_bayes import multinomialnb as mnb
model_nb = mnb()
model_nb.fit(x, y_train) #特徵資料直接灌進來
mnb(alpha=1.0, class_prior=none, fit_prior=true)
from sklearn.cross_validation import cross_val_score
import numpy as np
print "多項式貝葉斯分類器20折交叉驗證得分: ", np.mean(cross_val_score(model_nb, x, y_train, cv=20, scoring='roc_auc'))
# 多項式貝葉斯分類器20折交叉驗證得分: 0.950837239
多項式貝葉斯分類器20折交叉驗證得分: 0.950837239
# 折騰一下邏輯回歸,恩
from sklearn.linear_model import logisticregression as lr
from sklearn.grid_search import gridsearchcv
# 設定grid search的引數
grid_values =
# 設定打分為roc_auc
model_lr = gridsearchcv(lr(penalty = 'l2', dual = true, random_state = 0), grid_values, scoring = 'roc_auc', cv = 20)
# 資料灌進來
model_lr.fit(x,y_train)
# 20折交叉驗證,開始漫長的等待...
gridsearchcv(cv=20, estimator=logisticregression(c=1.0, class_weight=none, dual=true,
fit_intercept=true, intercept_scaling=1, penalty='l2', random_state=0, tol=0.0001),
fit_params={}, iid=true, loss_func=none, n_jobs=1,
param_grid=, pre_dispatch='2*n_jobs', refit=true,
score_func=none, scoring='roc_auc', verbose=0)
#輸出結果
print model_lr.grid_scores_
mean: 0.96459, std: 0.00489, params:
貝葉斯方法
貝葉斯推斷及其網際網路應用 q 已知某種疾病的發病率是0.001,即1000人中會有1個人得病。現有一種試劑可以檢驗患者是否得病,它的準確率是0.99,即在患者確實得病的情況下,它有99 的可能呈現陽性。它的誤報率是5 即在患者沒有得病的情況下,它有5 的可能呈現陽性。現有乙個病人的檢驗結果為陽性,...
貝葉斯演算法例項
這一節我們將通過乙個例子來觀察貝葉斯的運 況 def textparse bigstring import re listoftokens re.split r w bigstring return tok.lower for tok in listoftokens if len tok 2 def...
樸素貝葉斯例項
高斯模型api import numpy asnp from sklearn.bayes import gaussiannb from sklearn.datasets import load digits from sklearn.model selection import train test...