Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors. They can also make the text less readable, especially in reference printed books, where they are extensively used. Abbreviations are especially problematic in low-resource settings, where systems are less robust to begin with. In this paper, we propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text. We apply this method to the case of a Slovenian biographical lexicon and evaluate it on a newly developed gold-standard dataset of 51 Slovenian biographies. Our abbreviation identification method performs significantly better than commonly used ad-hoc solutions, especially at identifying unseen abbreviations. We also propose and present the results of a method for expanding the identified abbreviations in context.
翻译:缩略语对NLP系统来说是一个重大挑战,因为它们造成象征性和词汇外错误,还可以使文本更不易读,特别是在参考印刷书籍中,因为其被广泛使用; 缩略语在低资源环境中特别成问题,因为从一开始系统就不那么健全; 在本文中,我们提出了解决文本中特定域缩略语密度高所造成的问题的新方法; 我们将这种方法应用于斯洛文尼亚的传记词汇,并评价51个斯洛文尼亚传记新开发的黄金标准数据集; 我们的缩略语识别方法比常用的缩略语解决方案要好得多,特别是在识别未知缩略语方面; 我们还提出和介绍在上下文中扩大特定缩略语的方法的结果。