Geographic features are commonly used to improve the performance of pretrained language models (PLMs) on NLP tasks where they are intuitively beneficial (e.g., geolocation prediction, dialect feature prediction). Existing methods, however, leverage geographic information in task-specific fine-tuning and fail to integrate it into the geo-linguistic knowledge encoded by PLMs, which would make it transferable across different tasks. In this paper, we introduce an approach to task-agnostic geoadaptation of PLMs that forces them to learn associations between linguistic phenomena and geographic locations. Geoadaptation is an intermediate training step that couples language modeling and geolocation prediction in a multi-task learning setup. In our main set of experiments, we geoadapt BERTi\'{c}, a PLM for Bosnian-Croatian-Montenegrin-Serbian (BCMS), using a corpus of geotagged BCMS tweets. Evaluation on three tasks, namely fine-tuned as well as zero-shot geolocation prediction and zero-shot prediction of dialect features, shows that geoadaptation is very effective: e.g., we obtain state-of-the-art performance in supervised geolocation prediction and report massive gains over geographically uninformed PLMs on zero-shot geolocation prediction. Moreover, in follow-up experiments we successfully geoadapt two other PLMs, specifically ScandiBERT on Norwegian, Swedish, and Danish tweets and GermanBERT on Jodel posts in German from Austria, Germany, and Switzerland, proving that the benefits of geoadaptation are not limited to a particular language area and PLM.
翻译:通常使用地理特征来改进未受过训练的语言模型(PLM)在NLP任务上的业绩,这些模型具有直觉上的好处(例如地理定位预测、方言特征预测),但是,现有方法在任务特定的微调中利用地理信息,而没有将其纳入由PLM编码的地理语言知识中,这将使其在不同任务之间转移。在本文件中,我们引入了对任务不可知的PLM进行地理适应的方法,迫使他们学习语言现象和地理位置之间的联系。地理适应是一个中间培训步骤,即双人语言建模和地理定位预测,在多任务学习的设置中,是双对夫妇语言建模和地理定位预测的中间步骤。在我们的主要实验中,我们Geoadapt BERSTi\{c}(BMS),一个波斯尼亚-克罗地亚-黑山-黑山-塞尔维亚语言知识的PLMM(BMS),使用一系列带有地标的推图的推文。对任务进行评估,即精细调整和零点地理定位预测和对方言词特征的预测,表明,在德国的地理定位上的地理适应不是非常有效的,在德国的实地预测中,在地理定位上,我们在地理定位上的实地预测中获得了的,在德国的,在地理定位上,在地理定位上得到其他的,在地理定位上的推。