We propose a novel multitask learning method for diacritization which trains a model to both diacritize and translate. Our method addresses data sparsity by exploiting large, readily available bitext corpora. Furthermore, translation requires implicit linguistic and semantic knowledge, which is helpful for resolving ambiguities in the diacritization task. We apply our method to the Penn Arabic Treebank and report a new state-of-the-art word error rate of 4.79%. We also conduct manual and automatic analysis to better understand our method and highlight some of the remaining challenges in diacritization.
翻译:我们建议一种新颖的多种任务学习方法,用于对分级和翻译模式进行训练。我们的方法通过利用大量、现成的比特分子处理数据宽度问题。此外,翻译需要隐含的语言和语义知识,这有助于解决分级工作中的模糊性。我们把方法应用到Penn Arab Treebank, 并报告一个新的最先进的字词误差率4.79%。我们还进行人工和自动分析,以更好地了解我们的方法,并突出在分级方面仍然存在的一些挑战。