小語種nlp文字預處理資料清洗

開始繼續完成大資料實驗室招新題~

（roman urdu小語種為例）

link:

本練習賽所用資料，是名為「roman urdu dataset」的公開資料集。

這些資料，均為文字資料。原始資料的文字，對應三類情感標籤：positive, negative, netural。

本練習賽，移除了標籤為netural的資料樣例。因此，練習賽中，所有資料樣例的標籤為positive和negative。本練習賽的任務是「分類」。「分類目標」是用訓練好的模型，對測試集中的文字情感進行**，判斷其情感為「negative」或者「positive」。

這位大佬已經做的很完美，我僅僅只能是跟著他的步驟重新做了一遍，並記錄下筆記。

當得到資料時，發現資料中存在很多的的符號，沒法對資料進行直接分詞處理。

並在此之前檢查資料是否存在空缺

在此之後採用np.as_matrix()庫將資料轉換為矩陣形式

numpy_array = df_train.as_matrix(
)numpy_array_test = df_test.as_matrix(
)

因此採用了常見的清理函式，對資料中的符號標點進行了清洗替換

本次主要使用的是re庫的sub函式，也可以使用python基本庫中的replace庫。

re.sub(pattern, repl, string, count=
0, flags=
0)

引數：

pattern：表示正規表示式中的模式字串；

repl：被替換的字串（既可以是字串，也可以是函式）；

string：要被處理的，要被替換的字串；

count：匹配的次數, 預設是全部替換

flags：具體用處不詳

str
.replace(old, new[
,max
])

引數：

old：將被替換的子字串。

new：新字串，用於替換old子字串。

max：可選字串, 替換不超過 max 次

#nlp清理函式
defcleaner
(word)
:  word = re.sub(r'\#\.',''
, word)
word = re.sub(r'\n',''
, word)
word = re.sub(r',',''
, word)
word = re.sub(r'\-'
,' '
, word)
word = re.sub(r'\.',''
, word)
word = re.sub(r'\\'
,' '
, word)
word = re.sub(r'\\x\.+',''
, word)
word = re.sub(r'\d',''
, word)
word = re.sub(r'^_.',''
, word)
word = re.sub(r'_'
,' '
, word)
word = re.sub(r'^ ',''
, word)
word = re.sub(r' $',''
, word)
word = re.sub(r'\?',''
, word)
return word.lower(
)def
hashing
(word)
:  word = re.sub(r'ain$'
, r'ein'
, word)
word = re.sub(r'ai'
, r'ae'
, word)
word = re.sub(r'ay$'
, r'e'
, word)
word = re.sub(r'ey$'
, r'e'
, word)
word = re.sub(r'ie$'
, r'y'
, word)
word = re.sub(r'^es'
, r'is'
, word)
word = re.sub(r'a+'
, r'a'
, word)
word = re.sub(r'j+'
, r'j'
, word)
word = re.sub(r'd+'
, r'd'
, word)
word = re.sub(r'u'
, r'o'
, word)
word = re.sub(r'o+'
, r'o'
, word)
word = re.sub(r'ee+'
, r'i'
, word)
ifnot re.match(r'ar'
, word)
:    word = re.sub(r'ar'
, r'r'
, word)
word = re.sub(r'iy+'
, r'i'
, word)
word = re.sub(r'ih+'
, r'eh'
, word)
word = re.sub(r's+'
, r's'
, word)
if re.search(r'[rst]y'
,'word'
)and word[-1
]!='y':
word = re.sub(r'y'
, r'i'
, word)
if re.search(r'[bcdefghijklmnopqrtuvwxyz]i'
, word)
:    word = re.sub(r'i$'
, r'y'
, word)
if re.search(r'[acefghijlmnoqrstuvwxyz]h'
, word)
:    word = re.sub(r'h',''
, word)
word = re.sub(r'k'
, r'q'
, word)
return word
defarray_cleaner
(array)
:# x = array
x =for sentence in array:
clean_sentence =
''    words = sentence.split(
' ')
for word in words:
clean_sentence = clean_sentence +
' '+ cleaner(word)
return x
x_test = numpy_array_test[:,
1]x_train = numpy_array[:,
1]#clean x here
x_train = array_cleaner(x_train)
x_test = array_cleaner(x_test)
y_train = numpy_array[:,
2]

處理後觀察

發現資料已經被完全清洗。

NLP 中文文字預處理

jieba是乙個專門處理中文分詞的分詞庫,但其實功能比單純的分詞強大許多。中文不同於英文可以通過空格分開每個有意義的詞，對於中文需要乙個工具將完整的文字分割成更細緻的詞語，類似於英文分詞中使用的nltk工具，中文中需要使用jieba。pip install jieba 4.詞性標註 5.tokeni...

NLP系列文字預處理1

對一篇文章，一般的做法是先進行分詞，後續是對詞語進行語義特徵提取與建模，不過也有人是用句子或者單字粒度，個人實驗的結果是字元級比分詞好，句子級沒有試過。分詞後是去除停用詞以及標點符號，停用詞表到github上搜尋一下有挺多，裡面是像咳哇哈這些沒啥用的詞，把他們去掉對文字語義沒什麼影響，卻可以降低...

NLP 英文資料預處理

目錄理論文字特徵提取詞袋模型 tf idf模型高階詞向量模型部分 gensim doc2bow lda gensim tfidf lda 結果對比主流谷歌的word2vec演算法，它是乙個基於神經網路的實現，使用cbow continuous bags of words 和skip g...

小語種nlp文字預處理 資料清洗

NLP 中文文字預處理

NLP系列 文字預處理1

NLP 英文資料預處理

相關推薦

小語種nlp文字預處理資料清洗

NLP系列文字預處理1