文字預處理一般化流程

文字預處理一般包括以下幾個流程：

分詞（主要是中文分詞，英文分詞較簡單）

去除停用詞（中英文停用詞表）

詞幹提取、詞性轉換（針對英文，英文還有大小寫轉換的問題）

詞性標註

文字向量化（詞袋模型、tf-idf、分布式詞向量表示）

以下是python實現的文字預處理的主要流程

import numpy as np
import nltk
import jieba
from nltk.stem.porter import porterstemmer
from nltk.stem.lancaster import lancasterstemmer
from nltk.stem.snowball import snowballstemmer
from  nltk.stem.wordnet import wordnetlemmatizer
from sklearn.feature_extraction.text import countvectorizer, tfidfvectorizer, tfidftransformer
#分詞(tokenization)
deffenci_eng
(text)
:return text.split(
)def
fenci_cn
(text)
:return
list
(jieba.cut(text, cut_all=
false))
#精確模式  （預設）
#ps:可匯入自定義詞典，搜狗細胞詞庫
#去除停用詞(dropping stop terms)
defget_eng_stopwords()
:#功能詞、廣泛使用的詞
return nltk.corpus.stopwords.words(
'english'
)def
get_cn_stopwords()
:# 中文停用詞表(cn_stopwords)、哈工大停用詞表(hit_stopwords.txt)
filename = r"stopwords-master/scu_stopwords.txt"
stopwords =
with
open
(filename,
'r', encoding=
'utf-8'
)as f:
lines = f.readlines(
)for line in lines:))
return stopwords
#ps list無重複元素合併:  a = set(b)  a = list(a)
#標準化(normalization) : 詞幹提取(stemming)、詞形還原(lemmatization)
defstemming_eng
(words_list)
:#詞幹提取(stemming)
return
[snowballstemmer(language=
'english'
).stem(word)
for word in words_list]
deflemmatization_eng
(words_list)
:#詞形還原(lemmatization)
return
[wordnetlemmatizer(
).lemmatize(word)
for word in words_list]
#詞性標註(words tagging)
defwords_tag_eng
(words_list)
:return nltk.pos_tag(words_list, tagset=
'universal'
)#構建文件矩陣
#bag-of-words(bow)
#tf-idf
deftf_idf
(documents_list)
:    vectorizer = countvectorizer(min_df=1)
#詞頻，不統計標點
count = vectorizer.fit_transform(corpus)
# print(vectorizer.get_feature_names()) #單詞表
# print(vectorizer.vocabulary_)   #單詞出現次數
# print(count.toarray())  #詞頻矩陣
transformer = tfidftransformer(
)#calculate tf-idf
tfidf_matrix = transformer.fit_transform(count)
#print(tfidf_matrix.toarray())   #tf-idf矩陣
return tfidf_matrix.toarray(
)def
tf_idf_2
(documents_list)
:    tfidf_vec = tfidfvectorizer(min_df=1)
tfidf_matrix = tfidf_vec.fit_transform(documents_list)
#calculate tf-idf
# print(tfidf_vec.get_feature_names())  #單詞表
# print(tfidf_vec.vocabulary_)  #單詞出現次數
return tfidf_matrix.toarray(
)corpus =
['this is the first document'
,'this is the second second document.'
,'and the third one.'
,'is this the first document?',]
print
(tf_idf_2(corpus)
)

uml 關聯依賴聚集一般化

多型乙個名稱，多種形式。基於整合的多型。呼叫方法時，根據所給物件的不同選擇不同的處理方式。執行時繫結關聯當乙個物件對另乙個物件的引用去使用另乙個物件的服務或操作時，兩個物件之間就產生了關聯。聚合關聯關係的一種，乙個物件成為另乙個物件的組成部分，兩個物件間存在 has a 關係乙個物件作為另乙...

6 機器學習的一般化理論

1.界函式 bounding function 是指當最小突破點為k 時，生長函式m n 可能的最大值，記為b n,k 顯然，當k 1時，b n,1 1 當k n 時，b n,k 2 n 當k n 時，b n,k 2 n 1.於是很容易得到bounding function table 再來填下...

測試一般流程

目錄一測試準備階段 1.1 需求評審 1.2 測試計畫 1.3 測試用例二測試階段 2.1 開發自測 2.2 產品設計走查 2.3 測試接入測試 2.4 產品驗收階段三上線階段 3.1 與專案人員確認上線時間與策略 3.2 準備線上回歸的賬號，資料等 3.3 執行緊急回滾的策略 3.4開...

文字預處理一般化流程

uml 關聯 依賴 聚集 一般化

6 機器學習的一般化理論

測試一般流程

相關推薦

uml 關聯依賴聚集一般化