文字分析 NLTK訪問檔案

# -*-coding:utf-8-*-
from __future__ import division
import nltk, re, pprint
"""    從網路和硬碟中訪問文字：
1、電子書
2、處理的html
3、處理搜尋引擎的結果
4、讀取本地檔案
5、從pdf，word及其他二進位制格式中讀取
6、捕獲使用者輸入
7、nlp的流程
"""# 1、電子書
# from urllib import urlopen
## url = ''
# raw = urlopen(url).read()  # 讀取網路書籍
# # print raw
# print type(raw)
# print len(raw)
# print raw[:60]
## tokens = nltk.word_tokenize(raw)  # 分詞
# print type(tokens)
# print len(tokens)
# print tokens[:10]
# text = nltk.text(tokens)
# print type(text)
# # print text[1020:1060]
# print text.collocations()
# 2、處理html
# import nltk
# from urllib import urlopen
# import beautifulsoup
# url = ''
# html = urlopen(url).read()
# print html  # 將html的所有標籤、內容全部輸出
# print html[:60]
# raw = nltk.clean_html(html)  # 去除html，不能使用？？？？
# print raw[10:20]
# tokens = nltk.tokenize(raw)
# print tokens
# 3、讀取本地檔案
# f = open('dictionnaire.txt')
# raw = f.read()
# print raw
## for line in raw:
#     print line.strip()
# 4、從pdf、word提取檔案
# 5、使用者輸入
# s = raw_input("please enter some text:")
# print len(s)
"""    nlp流程：
1、開啟乙個url，讀取裡賣弄的html格式內容，並去除標記
2、對獲取的文字進行分詞處理，並將其轉換為text物件
3、將所有詞彙小寫，並提取詞彙表（去重,排序）
""""""
字串操作：
s.find(t)          字串 s 中包含 t 的第乙個索引(沒找到返回-1)
s.rfind(t)         字串 s 中包含 t 的最後乙個索引(沒找到返回-1)
s.index(t)         與 s.find(t) 功能類似，但沒找到時引起異常 valueerror
s.rindex(t)        與 s.rfind(t) 功能類似，但沒找到時引起異常 valueerror
s.join(text)       連線字串 s 與 text 中的詞彙
s.split(t)         在所有找到 t 的位置將 s 分割成煉表
s.splitlines()     將 s 按行分割成字串鍊錶
s.lower()          將字串 s 小寫
s.upper()          將字串 s 大寫
s.titlecase()      將字串 s 首字母大寫
s.strip()          返回乙個沒有首尾空白字元的 s 的拷貝
s.replace(t, u)    用 u 替換 s 中的 t
""""""
unicode:
unicode 支援一百萬種字元，每個字元分配乙個編號，稱為編碼點。
python中，編碼點寫做 \u***x 的形式，***x是四位十六進製制的數。
檔案中的文字都是有特定編碼的，所以需要一些機制來將文字翻譯成unicode,這個過程就是————解碼。
將unicode寫入乙個檔案或終端，首先需要將unicode轉化為河師大額編碼，這個過程就是————編碼。
gb2312  --> decode --> unicode --> encode --> gb2312
latin-2 --> decode --> unicode --> encode --> latin-2
utf-8   --> decode --> unicode --> encode --> utf-8
"""# 從檔案中提取已經編碼的檔案
# path = nltk.data.find('history of france.txt')
# import codecs
# f = codecs.open(path, encoding='utf8')
s = u'中華人民共和國'
# u = s.decode('utf8')
print s.encode('utf8')
print '-------------------------'
f = open('reddream.txt')
raw = f.read()
print raw

NLTK學習之一簡單文字分析

nltk的全稱是natural language toolkit，是一套基於python的自然語言處理工具集。nltk的安裝十分便捷，只需要pip就可以。pip install nltk在nltk中整合了語料與模型等的包管理器，通過在python直譯器中執行 import nltk nltk.dow...

python，文字分析

記得將當前目錄設定為檔案目錄 spyder編譯器的右上角，本人用spyder filename input 請輸入你的檔名 file open filename txt try for eachline in file print eachline except print 開啟檔案出錯 final...

文字分析awk

awk awk是乙個強大的文字分析工具。相對於grep的查詢，sed的編輯，awk在其對資料分析並生成報告時，顯得尤為強大。簡單來說awk就是把檔案逐行的讀入，空格，製表符為預設分隔符將每行切片，切開的部分再進行各種分析處理。awk f 支援自定義分隔符支援正規表示式匹配支援自定義變數，陣列 ...

文字分析 NLTK訪問檔案

NLTK學習之一 簡單文字分析

python，文字分析

文字分析awk

相關推薦

NLTK學習之一簡單文字分析