This paper proposes an approach to cross-language sentence selection in a low-resource setting. It uses data augmentation and negative sampling techniques on noisy parallel sentence data to directly learn a cross-lingual embedding-based query relevance model. Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval systems trained on the same parallel data. Moreover, when a rationale training secondary objective is applied to encourage the model to match word alignment hints from a phrase-based statistical machine translation model, consistent improvements are seen across three language pairs (English-Somali, English-Swahili and English-Tagalog) over a variety of state-of-the-art baselines.
翻译:本文提出了在低资源环境下跨语言判决选择的方法,在吵闹的平行句子数据方面采用数据扩增和负面抽样技术,直接学习一种跨语言嵌入式查询相关性模型,结果显示,这一方法不仅比就同一平行数据受过培训的多种最先进的机器翻译和单一语言检索系统效果更好,而且或更好。此外,在应用理由培训次级目标鼓励模型匹配基于词组的统计机器翻译模型的字词校对提示时,可以看到三种语言(英语-索马利语、英语-斯瓦希里语和英语-塔加格语)在各种最先进的基线上取得一致的改进。