Translating training data into many languages has emerged as a practical solution for improving cross-lingual transfer. For tasks that involve span-level annotations, such as information extraction or question answering, an additional label projection step is required to map annotated spans onto the translated texts. Recently, a few efforts have utilized a simple mark-then-translate method to jointly perform translation and projection by inserting special markers around the labeled spans in the original sentence. However, as far as we are aware, no empirical analysis has been conducted on how this approach compares to traditional annotation projection based on word alignment. In this paper, we present an extensive empirical study across 42 languages and three tasks (QA, NER, and Event Extraction) to evaluate the effectiveness and limitations of both methods, filling an important gap in the literature. Experimental results show that our optimized version of mark-then-translate, which we call EasyProject, is easily applied to many languages and works surprisingly well, outperforming the more complex word alignment-based methods. We analyze several key factors that affect end-task performance, and show EasyProject works well because it can accurately preserve label span boundaries after translation. We will publicly release all our code and data.
翻译:将培训数据转换成多种语文,已成为改进跨语言传输的一个实际解决办法。对于涉及跨层次说明的任务,例如信息提取或问答等,需要额外的标签投放步骤来绘制注释的横幅,以在翻译文本上绘制。最近,一些努力使用了简单的标记-实际翻译方法,在原句中标记的跨段周围插入特殊标记,以共同进行翻译和投影。然而,据我们所知,对于这种方法如何与基于单词对齐的传统批注预测相比,没有进行实证分析。在本文件中,我们介绍了42种语言和三项任务(QA、NER和Epident Mitteron)的广泛经验研究,以评价这两种方法的有效性和局限性,填补文献中的重要空白。实验结果表明,我们最优化的标记-当时翻译版本(我们称之为“简单工程”)很容易应用到许多语言,而且工作效果惊人,超过更复杂的单词校准方法。我们分析了影响最终工作的若干关键因素,并展示了“易行”工程,因为它能够准确地维护我们数据翻译之后的边界。