一階HMM詞性標註

手頭的語料庫依然是msr_training.utf8和msr_test.utf8，它來自於自于sighan bakeoff 2005的 icwb2-data.rar

1.rmspace.cpp研究院的訓練文件是已經分好詞，但我們並不需要這個結果，我們要使用計算所有分詞系統重新進行分詞並進行詞性標註，所以第一步要把訓練文件中行內的空格去掉。

#include#include#include#includeusing namespace std;
int main(int argc,char *argv)
string filename = argv[1];
string outfile = filename + ".ws";
string initpath = "/home/orisun/master/ictclas50_linux_rhas_32_c/api";
if (!ictclas_init(initpath.c_str())) 
ictclas_fileprocess(filename.c_str(), outfile.c_str(), code_type_utf8,1);
ictclas_exit();
return 0;
}

4.由於我們要做的是詞性標註，所以先要對測試文件進行分詞。仍然使用wordseg.cpp。

5.rmpos.cpp計算所的分詞系統在分詞的同時也做了詞性標註（修改配置檔案configure.xml是不起作用的）,所以現在還得把測試文字中標註好的詞性去掉。

#include#include#include#includeusing namespace std;
int main(int argc,char *argv)
ofs<6.對訓練文字（即第3步的輸出）也實行rmpos.cpp。
7.createdict.cpp第5步和第6步生成了訓練集和測試集中出現的所有詞語和標點符號，現在要把它們都存入gdbm資料庫。
#include#include#include#include#include#includeusing namespace std;
int main(int argc,char *argv);
for(key=gdbm_firstkey(dbm_ptr);key.dptr;key=gdbm_nextkey(dbm_ptr,key))
gdbm_close(dbm_ptr);
return 0;
}

9.query.c和lookup.c（可選輔助）前者列印輸出資料庫中的所有資料，後者根據使用者輸出的key去gdbm中查詢對應的value。

#include#include#include#include#include#include#define db_file_block "dict_db"
int main(int argc,char* argv)
printf("\n");
gdbm_close(dbm_ptr);
return 0;
}

#include#include#include#include#includeint main(int argc,char *argv)
else
}	gdbm_close(dbm_ptr);
return 0;
}

10.amatrix.cpp統計訓練文字（當然是第3步的輸出）生成狀態轉移矩陣和初始狀態概率矩陣，分別寫入a.mat和pi.mat。

header.h標頭檔案中主要包含ictclas的詞性標註集和good-turing平滑演算法。

#ifndef _header_h
#define _header_h
#include#include#includeusing namespace std;
const int pos_num=97;		//計算所漢語詞性標記集去掉標點符號共有pos_num個元素
/*pos_num種詞性，即pos_num種狀態*/
string posarr[pos_num]=;
void goodturing(const int count,double prob,int len)
else
}	map>::const_iterator iter=count_map.begin();
while(iter!=count_map.end())
else
}		else
listl=(--iter)->second;
list::const_iterator itr1=l.begin();
while(itr1!=l.end())
++iter;	}	
//概率歸一化
double sum=0;
for(int i=0;i
#include#include#include#include#include#include#include#include#include#include"header.h"
int a[pos_num][pos_num];	//記錄狀態間轉移的次數
int pi[pos_num];			//記錄各種狀態出現的次數
inline int indexof(string search);	//混淆矩陣（或稱發射矩陣）
inline int indexof(string search)	
}else	}	
//讀取b
for(int i=0;i>word;
b[i][j]=atof(word.c_str());
}	}ifs1.close();
ifs2.close();
ifs3.close();
}/*viterbi演算法進行詞性標註*/
void viterbi(vectorterms,string &result)
}q[i][j]=max*b[j][colindex];
path[i][j]=maxindex;
}	}//找q矩陣最後一行的最大值
double max=-1.0;
int maxindex=-1;
for(int i=0;imax)
}	//從maxindex出發，根據path矩陣找出最可能的狀態序列
stackst;
st.push(maxindex);
for(int i=row-1;i>0;--i)
//釋放二維陣列
for(int i=0;iterm_vec;
string result;
while(strstm>>term)
viterbi(term_vec,result);
ofs看一下效果吧，左邊是ictclas的pos-tagging結果，作為標準答案，右邊是我用一階hmm詞性標註的結果。
使用簡單的加1平滑：
可以看到詞性標註準確度還很低，並且"mq"貢獻了大部分的錯誤率。
使用good-turing平滑後的效果，大體上已經看不出有什麼錯誤：
				詞性標註 HMM
1.給定語料，統計語料中的詞性為n，片語為m。2.hmm關注3個引數 a.初始化概率 b.狀態轉移矩陣 n n c.觀測狀態概率矩陣 n m 3.狀態轉移矩陣 詞a的詞性為詞性a，詞b的詞性為詞性b，ab為相連詞，從給定的語料中統計從詞性a轉換到詞性b出現的次數 詞性a轉換到所有可能轉換的詞性的次數...
				HMM與分詞 詞性標註 命名實體識別
hmm 隱馬爾可夫模型 是用來描述隱含未知引數的統計模型，舉乙個經典的例子 乙個東京的朋友每天根據天氣決定當天的活動中的一種，我每天只能在twitter上看到她發的推 啊，我前天公園散步 昨天購物 今天清理房間了！那麼我可以根據她發的推特推斷東京這三天的天氣。在這個例子裡，顯狀態是活動，隱狀態是天氣...
				一階邏輯 備忘
所有的無限迴圈小數都是有理數。即 對於論域中的所有個體，要麼它不是無限迴圈小數 要麼它是無限迴圈小數，同時是有理數。f x x是無限迴圈小數 g x x是有理數 x g 有的素數是偶數。即 存在乙個數，它是素數，同時它也是偶數。f x x是素數 g x x是偶數 x 並非所有的f都g x f x g...

一階HMM詞性標註

詞性標註 HMM

HMM與分詞詞性標註命名實體識別

一階邏輯備忘

一階HMM詞性標註

詞性標註 HMM

HMM與分詞 詞性標註 命名實體識別

一階邏輯 備忘

相關推薦

HMM與分詞詞性標註命名實體識別

一階邏輯備忘