低资源双语方言词典归纳与大型语言模型 (Low-resource Bilingual Dialect Lexicon Induction with Large Language Models)

Bilingual word lexicons are crucial tools for multilingual natural language understanding and machine translation tasks, as they facilitate the mapping of words in one language to their synonyms in another language. To achieve this, numerous papers have explored bilingual lexicon induction (BLI) in high-resource scenarios, using a typical pipeline consisting of two unsupervised steps: bitext mining and word alignment, both of which rely on pre-trained large language models~(LLMs). In this paper, we present an analysis of the BLI pipeline for German and two of its dialects, Bavarian and Alemannic. This setup poses several unique challenges, including the scarcity of resources, the relatedness of the languages, and the lack of standardization in the orthography of dialects. To evaluate the BLI outputs, we analyze them with respect to word frequency and pairwise edit distance. Additionally, we release two evaluation datasets comprising 1,500 bilingual sentence pairs and 1,000 bilingual word pairs. They were manually judged for their semantic similarity for each Bavarian-German and Alemannic-German language pair.

翻译：双语词汇表对于多语言的自然语言理解和机器翻译任务是至关重要的工具，因为它们可以将一个语言中的单词映射到另一个语言中的同义词。为了实现这一点，许多论文已经探索了高资源场景下的双语词汇表归纳，使用典型的管道包括两个无监督步骤：双语文本挖掘和单词对齐，这两个步骤都依赖于预训练的大型语言模型（LLM）。在本文中，我们提供了德语及其两种方言（巴伐利亚语和阿勒曼尼亚语）的双语词汇表归纳分析。这个设置面临着一些独特的挑战，包括资源稀缺、语言相关性以及方言正字法的缺乏标准化。为了评估双语词汇表归纳的输出，我们分析了它们与单词频率和配对编辑距离的关系。此外，我们发布了两个评估数据集，分别包含 1,500 个双语句子对和 1,000 个双语词语对。对于每个巴伐利亚语-德语和阿勒曼尼亚语-德语语言对，这些数据集被人工评估了它们的语义相似性。