This paper introduces a new annotated corpus based on an existing informal text corpus: the NUS SMS Corpus (Chen and Kan, 2013). The new corpus includes 76,490 noun phrases from 26,500 SMS messages, annotated by university students. We then explored several graphical models, including a novel variant of the semi-Markov conditional random fields (semi-CRF) for the task of noun phrase chunking. We demonstrated through empirical evaluations on the new dataset that the new variant yielded similar accuracy but ran in significantly lower running time compared to the conventional semi-CRF.
翻译:本文根据现有的非正式文本介绍一个新的附加说明的文稿:NUS SMS Corpus(Chen和Kan,2013年),新文稿包括26 500条短信短信中的76 490个词句,由大学生附加说明,然后我们探索了几个图形模型,包括半马尔科夫有条件随机域(semi-CRF)的新变式,用于拼凑名词。我们通过对新数据集的经验评估表明,新变式的准确性与传统的半通用报告格式相似,但运行时间要大大低于传统的半通用报告格式。