貝葉斯方法及電影評價例項

2021-09-20 15:32:12 字數 4293 閱讀 3410

kaggle imdb影評者情感褒貶分類問題,kaggle位址為

原文使用的方法是word2vec將詞語轉為詞向量,再用deep learning方式處理,我們這裡使用tf-idf作為特徵,用最簡單的樸素貝葉斯和邏輯回歸嘗試

import re  #正規表示式

from bs4 import beautifulsoup #html標籤處理

import pandas as pd

def review_to_wordlist(review):

''''''

# 去掉html標籤,拿到內容

review_text = beautifulsoup(review).get_text()

# 用正規表示式取出符合規範的部分

review_text = re.sub("[^a-za-z]"," ", review_text)

# 小寫化所有的詞,並轉成詞list

words = review_text.lower().split()

# 返回words

return words

# 使用pandas讀入訓練和測試csv檔案

train = pd.read_csv('/users/hanxiaoyang/imdb_sentiment_analysis_data/labeledtraindata.tsv', header=0, delimiter="\t", quoting=3)

test = pd.read_csv('/users/hanxiaoyang/imdb_sentiment_analysis_data/testdata.tsv', header=0, delimiter="\t", quoting=3 )

# 取出情感標籤,positive/褒 或者 negative/貶

y_train = train['sentiment']

# 將訓練和測試資料都轉成詞list

train_data =

for i in xrange(0,len(train['review'])):

test_data =

for i in xrange(0,len(test['review'])):

train_data
y_train
0        1

1 1

2 0

3 0

4 1

5 1

6 0

7 0

8 0

9 1

10 0

11 1

12 1

13 0

14 0

15 0

16 0

17 0

18 1

19 1

20 1

21 1

22 1

23 0

24 0

25 1

26 0

27 0

28 0

29 0

..24970 1

24971 1

24972 1

24973 0

24974 1

24975 1

24976 0

24977 1

24978 1

24979 1

24980 1

24981 1

24982 0

24983 0

24984 0

24985 0

24986 1

24987 1

24988 1

24989 1

24990 1

24991 0

24992 0

24993 0

24994 0

24995 0

24996 0

24997 0

24998 0

24999 1

name: sentiment, dtype: int64

from sklearn.feature_extraction.text import tfidfvectorizer as tfiv

# 初始化tfiv物件,去停用詞,加2元語言模型

tfv = tfiv(min_df=3, max_features=none, strip_accents='unicode', analyzer='word',token_pattern=r'\w', ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1, stop_words = 'english')

# 合併訓練和測試集以便進行tfidf向量化操作

x_all = train_data + test_data

len_train = len(train_data)

# 這一步有點慢,去喝杯茶刷會兒微博知乎歇會兒...

tfv.fit(x_all)

x_all = tfv.transform(x_all)

# 恢復成訓練集和測試集部分

x = x_all[:len_train]

x_test = x_all[len_train:]

# 多項式樸素貝葉斯

from sklearn.*****_bayes import multinomialnb as mnb

model_nb = mnb()

model_nb.fit(x, y_train) #特徵資料直接灌進來

mnb(alpha=1.0, class_prior=none, fit_prior=true)

from sklearn.cross_validation import cross_val_score

import numpy as np

print "多項式貝葉斯分類器20折交叉驗證得分: ", np.mean(cross_val_score(model_nb, x, y_train, cv=20, scoring='roc_auc'))

# 多項式貝葉斯分類器20折交叉驗證得分: 0.950837239

多項式貝葉斯分類器20折交叉驗證得分:  0.950837239
# 折騰一下邏輯回歸,恩

from sklearn.linear_model import logisticregression as lr

from sklearn.grid_search import gridsearchcv

# 設定grid search的引數

grid_values =

# 設定打分為roc_auc

model_lr = gridsearchcv(lr(penalty = 'l2', dual = true, random_state = 0), grid_values, scoring = 'roc_auc', cv = 20)

# 資料灌進來

model_lr.fit(x,y_train)

# 20折交叉驗證,開始漫長的等待...

gridsearchcv(cv=20, estimator=logisticregression(c=1.0, class_weight=none, dual=true,

fit_intercept=true, intercept_scaling=1, penalty='l2', random_state=0, tol=0.0001),

fit_params={}, iid=true, loss_func=none, n_jobs=1,

param_grid=, pre_dispatch='2*n_jobs', refit=true,

score_func=none, scoring='roc_auc', verbose=0)

#輸出結果

print model_lr.grid_scores_

mean: 0.96459, std: 0.00489, params:

貝葉斯方法

貝葉斯推斷及其網際網路應用 q 已知某種疾病的發病率是0.001,即1000人中會有1個人得病。現有一種試劑可以檢驗患者是否得病,它的準確率是0.99,即在患者確實得病的情況下,它有99 的可能呈現陽性。它的誤報率是5 即在患者沒有得病的情況下,它有5 的可能呈現陽性。現有乙個病人的檢驗結果為陽性,...

貝葉斯演算法例項

這一節我們將通過乙個例子來觀察貝葉斯的運 況 def textparse bigstring import re listoftokens re.split r w bigstring return tok.lower for tok in listoftokens if len tok 2 def...

樸素貝葉斯例項

高斯模型api import numpy asnp from sklearn.bayes import gaussiannb from sklearn.datasets import load digits from sklearn.model selection import train test...