c 實現文字中英文單詞和漢字字元的統計

1.統計文字中漢字的頻數，為後續的文字分類做基礎。對於漢字的統計，需要判斷讀取的是否為漢字。源**如下：

[c++ code]

[cpp]view plain

copy

/**@author:鄭海波

*參考：實驗室小熊

*注：有刪改

*/#pragma warning(disable:4786)

#include

using

namespace

std;

void

topk(

const

int&k)

else

//非漢字字元不統計

wordcount[s]++;

s=""

; }

cout<

//優先佇列使用小頂堆，排在前面的數量少，使用">";

priority_queue

,string >,vector

int,string > >,greater

int,string> > > queuek;

for(mapint

>::iterator iter=wordcount.begin(); iter!=wordcount.end(); iter++)

pair

,string>tmp;

//將排在後面的數量少，排在前面的數量多

priority_queue

,string >,vector

int,string > >,less

int,string> > > queuekless;

while

(!queuek.empty())

while

(!queuekless.empty())

cout<

" s>"

} int

main()

return

}

[圖1]

2.統計英文單詞的出現頻率。這比統計漢字更加的容易，因為單詞和單詞之間是用空格分開的，所以，直接將單詞儲存到string中即可。

[c++ code]

[cpp]view plain

copy

/**@author:鄭海波

*參考：實驗室小熊

*注：有刪改

*/#pragma warning(disable:4786)

#include

using

namespace

std;

void

topk(

const

int&k)

cout<

//優先佇列使用小頂堆，排在前面的數量少，使用">";

priority_queue

,string >,vector

int,string > >,greater

int,string> > > queuek;

for(mapint

>::iterator iter=wordcount.begin(); iter!=wordcount.end(); iter++)

pair

,string>tmp;

priority_queue

,string >,vector

int,string > >,less

int,string> > > queuekless;

while

(!queuek.empty())

while

(!queuekless.empty())

cout<

" >"

} int

main()

return

}

[圖2]

參考：實驗室小熊

python中英文本母和中文漢字所佔的位元組

1.判斷所佔位元組數可以用下面語句判斷中文和符號 print type 中文 encode utf 8 輸出為bytes型別執行結果 class bytes print type 中文 encode gbk 執行結果 class bytes print len 中文 encode utf 8 輸...

python中英文本母和中文漢字所佔的位元組

print type 李傑 encode utf 8 print type 李傑 encode gbk print len 李傑 encode utf 8 6 print len 李傑 encode gbk 4 print len encode gbk 4 英文和符號 print type li e...

統計分析文章中英文單詞出現次數及頻率（C 實現）

設計思路 1.為了統計資料具有實際意義 1 統計中需要剔除一些無統計意義的詞，例如 am is are 等虛詞代詞連詞等。2 一般的文章句首單詞首字母為大寫，此時需要將此類情況的大寫字母轉化為小寫字母，但值得一提的是，許多專有名詞入如 who iphone 等詞不應作此類處理。3 為了應對文章總...

c 實現文字中英文單詞和漢字字元的統計

python中英文本母和中文漢字所佔的位元組

python中英文本母和中文漢字所佔的位元組

統計分析文章中英文單詞出現次數及頻率（C 實現）

相關推薦