This paper describes performance of CRF based systems for Named Entity Recognition (NER) in Indian language as a part of ICON 2013 shared task. In this task we have considered a set of language independent features for all the languages. Only for English a language specific feature, i.e. capitalization, has been added. Next the use of gazetteer is explored for Bengali, Hindi and English. The gazetteers are built from Wikipedia and other sources. Test results show that the system achieves the highest F measure of 88% for English and the lowest F measure of 69% for both Tamil and Telugu. Note that for the least performing two languages no gazetteer was used. NER in Bengali and Hindi finds accuracy (F measure) of 87% and 79%, respectively.
翻译:本文介绍印度语的基于通用报告格式的印地安实体识别系统的业绩,作为ICON 2013 年共同任务的一部分,我们考虑了所有语文的一套独立语言特征。我们在此任务中考虑了所有语文的一套独立语言特征。只添加了英语的一个特定语言特征,即资本化。接下来是探索孟加拉语、印地语和英语使用地名录。地名录是从维基百科和其他来源建起来的。测试结果表明,该系统的英文衡量法最高,为88%,泰米尔语和泰鲁古语衡量法最低,为69%。请注意,对于最不起作用的两种语言,没有使用地名录。孟加拉语和印地语的NER(F衡量法)的精确度分别为87%和79%。