The most common Named Entity Recognizers are usually sequence taggers trained on fully annotated corpora, i.e. the class of all words for all entities is known. Partially annotated corpora, i.e. some but not all entities of some types are annotated, are too noisy for training sequence taggers since the same entity may be annotated one time with its true type but not another time, misleading the tagger. Therefore, we are comparing three training strategies for partially annotated datasets and an approach to derive new datasets for new classes of entities from Wikipedia without time-consuming manual data annotation. In order to properly verify that our data acquisition and training approaches are plausible, we manually annotated test datasets for two new classes, namely food and drugs.
翻译:最常用的命名实体识别者通常是在完全附加说明的社团上培训的序列标记者,即所有实体的所有单词类别是已知的。部分附加说明的社团,即某些但并非所有某些类型实体都有附加说明,对于培训序列标记者来说过于吵闹,因为同一实体可能一次以其真实类型附加说明,而不是另一次误导标签者。因此,我们比较了部分附加说明的数据集的三个培训战略,以及从维基百科为新类别实体推出新数据集的方法,但没有耗费时间的人工数据说明。为了适当核实我们的数据获取和培训方法是否合理,我们为食品和药物这两个新类别手动了附加说明的测试数据集。