Event data, or structured records of ``who did what to whom'' that are automatically extracted from text, is an important source of data for scholars of international politics. The high cost of developing new event datasets, especially using automated systems that rely on hand-built dictionaries, means that most researchers draw on large, pre-existing datasets such as ICEWS rather than developing tailor-made event datasets optimized for their specific research question. This paper describes a ``bag of tricks'' for efficient, custom event data production, drawing on recent advances in natural language processing (NLP) that allow researchers to rapidly produce customized event datasets. The paper introduces techniques for training an event category classifier with active learning, identifying actors and the recipients of actions in text using large language models and standard machine learning classifiers and pretrained ``question-answering'' models from NLP, and resolving mentions of actors to their Wikipedia article to categorize them. We describe how these techniques produced the new POLECAT global event dataset that is intended to replace ICEWS, along with examples of how scholars can quickly produce smaller, custom event datasets. We publish example code and models to implement our new techniques.
翻译:摘要:事件数据,即从文本中自动提取的“谁对谁做了什么”的结构化记录,是国际政治学者的重要数据来源。开发新的事件数据集的高成本,特别是使用依赖手工构建字典的自动化系统,意味着大多数研究人员使用大型预先存在的数据集,例如ICEWS,而不是开发定制的事件数据集,以优化他们的具体研究问题。本文描述了一种“招数法”,用于高效的自定义事件数据生成,利用自然语言处理(NLP)的最新进展,使研究人员能够快速生成定制的事件数据集。本文介绍了使用主动学习训练事件类别分类器的技术,使用大型语言模型和标准机器学习分类器以及NLP中预训练的“问答”模型,在文本中识别演员和行动的受体以及解决演员提及并将其分配到他们的维基百科文章中进行分类的技术。我们描述了这些技术是如何产生新的POLECAT全球事件数据集,旨在替代ICEWS,以及学者如何快速生成较小的自定义事件数据集的示例。我们发布了实施我们的新技术的示例代码和模型。