NLTK（處理原始文字）

import nltk
from urllib.request import urlopen
url=""
raw = urlopen(url).read()
print (type(raw))
print (len(raw))
print (raw[:75])

輸出

1201733

b'\xef\xbb\xbfthe project gutenberg ebook of crime and punishment, by fyodor dostoevsk'

python3 中變數 raw 是 bytes型別，需要通過str型別和bytes型別的轉換，才能呼叫分詞的函式nltk.word_tokenize()

c=str(raw,encoding='utf-8')  ##將位元組轉換成字元
print (type(c))
tokens = nltk.word_tokenize(c)  #分詞
print (type(tokens))
print (len(tokens))
print (tokens[:10])

輸出

257726

['\ufeffthe', 'project', 'gutenberg', 'ebook', 'of', 'crime', 'and', 'punishment', ',', 'by']

把分詞得到的list 變成text，便可以如第一章一樣對text操作

text = nltk.text(tokens)
print (type
(text))

document.txt 放在和.py檔案在同乙個目錄

f = open('document.txt')
raw = f.read()     
print (type(raw))   #classs 'str'
print (raw)

檔案放在桌面

f = open(r'c:\users\administrator.lyh-20170315dbk\desktop\document.txt')

使用nltk進行文字預處理

coding utf 8 import nltk import re import string from nltk.corpus import brown from nltk.book import from nltk.tokenize import wordpuncttokenizer prin...

文字分析 NLTK訪問檔案

coding utf 8 from future import division import nltk,re,pprint 從網路和硬碟中訪問文字 1 電子書 2 處理的html 3 處理搜尋引擎的結果 4 讀取本地檔案 5 從pdf，word及其他二進位制格式中讀取 6 捕獲使用者輸入 7 nl...

NLTK學習之一簡單文字分析

nltk的全稱是natural language toolkit，是一套基於python的自然語言處理工具集。nltk的安裝十分便捷，只需要pip就可以。pip install nltk在nltk中整合了語料與模型等的包管理器，通過在python直譯器中執行 import nltk nltk.dow...

NLTK（處理原始文字）

使用nltk進行文字預處理

文字分析 NLTK訪問檔案

NLTK學習之一 簡單文字分析

相關推薦

NLTK學習之一簡單文字分析