International library standards require cataloguers to tediously input Romanization of their catalogue records for the benefit of library users without specific language expertise. In this paper, we present the first reported results on the task of automatic Romanization of undiacritized Arabic bibliographic entries. This complex task requires the modeling of Arabic phonology, morphology, and even semantics. We collected a 2.5M word corpus of parallel Arabic and Romanized bibliographic entries, and benchmarked a number of models that vary in terms of complexity and resource dependence. Our best system reaches 89.3% exact word Romanization on a blind test set. We make our data and code publicly available.
翻译:国际图书馆标准要求编目者为没有特定语言专门知识的图书馆用户大量输入其目录记录的罗马化,本文介绍关于未对称的阿拉伯文书目条目自动罗马化任务的第一批报告结果。这项复杂的任务要求建立阿拉伯语声学、形态学、甚至语义学的模型。我们收集了2.5M字集的平行阿拉伯文和罗马化书目条目,并参照了在复杂性和资源依赖性方面各不相同的若干模型。我们的最佳系统在盲人测试集上达到89.3%的精确罗马化字词。我们公开提供我们的数据和代码。