五 N gram語言模型的資料處理

一、步驟

資料集說明：一段英文

（1）分詞：把原始的英文分詞，只保留詞之間的順序不變，多個句子也是看出整體進行分詞。

（2）統計詞頻：按照n元進行詞頻統計，比如「i love nlp i enjoy it」當n=2時候，可以劃分為（【i love】，【love nlp】，【nlp i】…），分別統計【i love】，【love nlp】等出現的次數。（在樸素貝葉斯中只是統計乙個詞，這裡是統計n個前後關聯的詞）

（3）對統計好的詞進行大到小的排序，取m和詞作為特徵向量

其他步驟同文字分類步驟

二、**

# -*- coding:utf-8 -*-
import urllib2
import re
import string
import operator
defcleantext
(input):
input = re.sub('\n+', " ", input).lower()  # 匹配換行,用空格替換換行符
input = re.sub('\[[0-9]*\]', "", input)  # 剔除類似[1]這樣的引用標記
input = re.sub(' +', " ", input)  # 把連續多個空格替換成乙個空格
input = bytes(input)  # .encode('utf-8') # 把內容轉換成utf-8格式以消除轉義字元
# input = input.decode("ascii", "ignore")
return input
defcleaninput
(input):
input = cleantext(input)
cleaninput = 
input = input.split(' ')  # 以空格為分隔符，返回列表
for item in input:
item = item.strip(string.punctuation)  # string.punctuation獲取所有標點符號
if len(item) > 1
or (item.lower() == 'a'
or item.lower() == 'i'):
return cleaninput
defgetngrams
(input, n):
#把一段英文處理成乙個個詞語，保留了分詞後每個詞在原短文中的順序
input = cleaninput(input)
output = {}  # 構造字典
for i in range(len(input) - n + 1):
ngramtemp = " ".join(input[i:i + n])
if ngramtemp not
in output:  # 詞頻統計
output[ngramtemp] = 0
output[ngramtemp] += 1
return output
# 獲取資料，content為一段英文
content = urllib2.urlopen(urllib2.request("")).read()
#2-grams
ngrams = getngrams(content, 2)
sortedngrams = sorted(ngrams.items(), key=operator.itemgetter(1), reverse=true)  # =true 降序排列
print(sortedngrams)

自然語言處理（N gram語言模型）

n gram語言模型問題描述由於公司業務產品中，需要使用者自己填寫公司名稱，而這個公司名稱存在大量的亂填現象，因此需要對其做一些歸一化的問題。在這基礎上，能延伸出乙個使用者填寫的公司名是否有效的模型出來。目標問題提出來了，就是想找到一種辦法來使用者填寫的公司名是否有效？問題分析要想使用...

自然語言處理 n gram模型深度理解

n gram模型是自然語言處理裡面的乙個傳統模型。我們來看看他是怎麼實現的吧！要了解n gram模型，我們先來看看什麼是語言模型！the students opened their 其中可以填寫books laptops exam minds 那麼語言模型就是用來這個空當中應該填寫什麼單詞。語言模...

自然語言處理中的N Gram模型詳解

n gram 有時也稱為n元模型是自然語言處理中乙個非常重要的概念，通常在nlp中，人們基於一定的語料庫，可以利用n gram來預計或者評估乙個句子是否合理。另外一方面，n gram的另外乙個作用是用來評估兩個字串之間的差異程度。這是模糊匹配中常用的一種手段。本文將從此開始，進而向讀者展示n gr...

五 N gram語言模型的資料處理

自然語言處理（N gram語言模型）

自然語言處理 n gram模型深度理解

自然語言處理中的N Gram模型詳解

相關推薦