TopoBERT:插件和游戏地名识别模块 (TopoBERT: Plug and Play Toponym Recognition Module Harnessing Fine-tuned BERT)

Extracting precise geographical information from textual contents is crucial in a plethora of applications. For example, during hazardous events, a robust and unbiased toponym extraction framework can provide an avenue to tie the location concerned to the topic discussed by news media posts and pinpoint humanitarian help requests or damage reports from social media. Early studies have leveraged rule-based, gazetteer-based, deep learning, and hybrid approaches to address this problem. However, the performance of existing tools is deficient in supporting operations like emergency rescue, which relies on fine-grained, accurate geographic information. The emerging pretrained language models can better capture the underlying characteristics of text information, including place names, offering a promising pathway to optimize toponym recognition to underpin practical applications. In this paper, TopoBERT, a toponym recognition module based on a one dimensional Convolutional Neural Network (CNN1D) and Bidirectional Encoder Representation from Transformers (BERT), is proposed and fine-tuned. Three datasets (CoNLL2003-Train, Wikipedia3000, WNUT2017) are leveraged to tune the hyperparameters, discover the best training strategy, and train the model. Another two datasets (CoNLL2003-Test and Harvey2017) are used to evaluate the performance. Three distinguished classifiers, linear, multi-layer perceptron, and CNN1D, are benchmarked to determine the optimal model architecture. TopoBERT achieves state-of-the-art performance (f1-score=0.865) compared to the other five baseline models and can be applied to diverse toponym recognition tasks without additional training.

翻译：从文字内容中提取精确的地理信息对大量应用至关重要。例如,在危险事件期间,一个强有力和不带偏见的地名提取框架可以提供一个途径,将相关地点与新闻媒体文章讨论的主题联系起来,并指明社会媒体的人道主义援助请求或损害报告。早期研究利用了基于规则、地名录、深层学习和混合方法来解决这一问题。然而,现有工具在支持应急救援等行动方面表现不足,而应急救援依赖精细的准确地理信息。新兴的预先培训语言模型可以更好地捕捉文字信息的基本特征,包括地名,为优化地名识别以支撑实际应用提供一个有希望的途径。在本文件中,TopoBERT是一个地名识别模块,该模块基于一个立体进化神经网络(CNN1D)和来自变压器(BERT)的双向访问显示器演示。三种数据集(CNLLF1-Train、Wik-300、WNUT20、WNUT-2017)应用的其他基线模型可以用来调整超常数的超常数数据,发现最佳培训战略,并用到最优的O-SLILO-S-S-S-S-S-ILA-S-S-S-S-S-S-S-S-S-S-IAR-S-S-IAR-IAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-IAR-S-S-S-S-S-S-S-S-S-IAR-S-S-S-S-IAR-S-S-S-IAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SD-S-SAR-SAR-SAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S