Text data are an important source of detailed information about social and political events. Automated systems parse large volumes of text data to infer or extract structured information that describes actors, actions, dates, times, and locations. One of these sub-tasks is geocoding: predicting the geographic coordinates associated with events or locations described by a given text. We present an end-to-end probabilistic model for geocoding text data. Additionally, we collect a novel data set for evaluating the performance of geocoding systems. We compare the model-based solution, called ELECTRo-map, to the current state-of-the-art open source system for geocoding texts for event data. Finally, we discuss the benefits of end-to-end model-based geocoding, including principled uncertainty estimation and the ability of these models to leverage contextual information.
翻译:文本数据是社会和政治事件详细信息的重要来源。 自动化系统分析大量文本数据,以推断或提取结构化信息, 描述行为者、 行动、 日期、 时间和地点。 这些子任务之一是地理编码: 预测与某一文本描述的事件或地点相关的地理坐标。 我们为地理编码文本数据提出了一个端到端的概率模型。 此外, 我们收集一套用于评价地理编码系统性能的新数据。 我们将模型为基础的解决方案, 称为 ELECTRo-map, 与当前最先进的事件数据地理编码文本开放源系统进行比较。 最后, 我们讨论了基于终端到终端模型的地理编码的好处, 包括原则不确定性估计和这些模型利用背景信息的能力。