The task of event extraction has long been investigated in a supervised learning paradigm, which is bound by the number and the quality of the training instances. Existing training data must be manually generated through a combination of expert domain knowledge and extensive human involvement. However, due to drastic efforts required in annotating text, the resultant datasets are usually small, which severally affects the quality of the learned model, making it hard to generalize. Our work develops an automatic approach for generating training data for event extraction. Our approach allows us to scale up event extraction training instances from thousands to hundreds of thousands, and it does this at a much lower cost than a manual approach. We achieve this by employing distant supervision to automatically create event annotations from unlabelled text using existing structured knowledge bases or tables.We then develop a neural network model with post inference to transfer the knowledge extracted from structured knowledge bases to automatically annotate typed events with corresponding arguments in text.We evaluate our approach by using the knowledge extracted from Freebase to label texts from Wikipedia articles. Experimental results show that our approach can generate a large number of high quality training instances. We show that this large volume of training data not only leads to a better event extractor, but also allows us to detect multiple typed events.
翻译:长期以来,事件提取任务一直是在受监督的学习模式中调查的,这种模式受培训案例的数量和质量的约束。现有的培训数据必须通过专家领域知识的结合和广泛的人类参与来手工生成。然而,由于在说明文本方面需要付出巨大的努力,由此产生的数据集通常很小,对所学模型的质量造成不同程度的影响,难以概括。我们的工作为生成事件提取培训数据开发了一种自动方法。我们的方法使我们能够将事件提取培训案例从数千个增加到数十万个,而且这样做的成本比人工方法低得多。我们通过远程监督,利用现有结构化知识基础或表格自动从未贴标签的文本中生成事件说明。我们随后开发了一个神经网络模型,从结构化的知识库中提取的知识被自动转换为带有相应文本参数的预告事件。我们通过使用从Freebase提取的知识来评估我们的方法,将事件提取到贴上维基百科文章的标签。实验结果显示,我们的方法可以产生大量高质量的培训案例。我们显示,大量培训类型的数据能够更好地提取事件。