Document alignment techniques based on multilingual sentence representations have recently shown state of the art results. However, these techniques rely on unsupervised distance measurement techniques, which cannot be fined-tuned to the task at hand. In this paper, instead of these unsupervised distance measurement techniques, we employ Metric Learning to derive task-specific distance measurements. These measurements are supervised, meaning that the distance measurement metric is trained using a parallel dataset. Using a dataset belonging to English, Sinhala, and Tamil, which belong to three different language families, we show that these task-specific supervised distance learning metrics outperform their unsupervised counterparts, for document alignment.
翻译:基于多语种句子表述的文档校正技术最近显示了最新的最新结果,然而,这些技术依赖于不受监督的远程测量技术,无法根据手头的任务进行微调。在本文中,我们使用Metric Learning,而不是这些不受监督的远程测量技术,来得出具体任务的远程测量。这些测量方法受到监督,这意味着使用平行数据集对距离测量标准进行了培训。我们使用属于三个不同语言家庭、属于英语、僧伽罗语和泰米尔语的数据集,我们表明这些任务特定的监控远程学习指标比未经监督的对应指标要强,用于文件校正。