Multilingual representations pre-trained with monolingual data exhibit considerably unequal task performances across languages. Previous studies address this challenge with resource-intensive contextualized alignment, which assumes the availability of large parallel data, thereby leaving under-represented language communities behind. In this work, we attribute the data hungriness of previous alignment techniques to two limitations: (i) the inability to sufficiently leverage data and (ii) these techniques are not trained properly. To address these issues, we introduce supervised and unsupervised density-based approaches named Real-NVP and GAN-Real-NVP, driven by Normalizing Flow, to perform alignment, both dissecting the alignment of multilingual subspaces into density matching and density modeling. We complement these approaches with our validation criteria in order to guide the training process. Our experiments encompass 16 alignments, including our approaches, evaluated across 6 language pairs, synthetic data and 4 NLP tasks. We demonstrate the effectiveness of our approaches in the scenarios of limited and no parallel data. First, our supervised approach trained on 20k parallel data mostly surpasses Joint-Align and InfoXLM trained on much larger parallel data. Second, parallel data can be removed without sacrificing performance when integrating our unsupervised approach in our bootstrapping procedure, which is theoretically motivated to enforce equality of multilingual subspaces. Moreover, we demonstrate the advantages of validation criteria over validation data for guiding supervised training. Our code is available at \url{https://github.com/AIPHES/Real-NVP}.
翻译:在这项工作中,我们将先前的调整技术中的数据缺乏归因于两个局限性:(一) 无法充分利用数据,以及(二) 这些技术没有经过适当的培训。为了解决这些问题,我们采用了监督和不受监督的基于密度的方法,即Real-NVP和GAN-Real-NVP, 由正常化流程驱动,以进行统一,既要将多语言子空间的调整分解为密度匹配和密度模型,又要将这些方法与我们的验证标准相配合,以指导培训进程。我们实验包括16个匹配方法,包括我们的方法,在6对语言、合成数据和4个国家轨道项目任务中加以评估。我们展示了在有限和没有平行数据的情况下我们的做法的有效性。首先,我们在20k平行数据方面受到的监管方法培训,大多超过了联合-Al和InfoXLM, 以更大规模平行数据匹配和密度模型为基础。第二,我们用验证标准来补充这些方法,以指导培训过程的验证。我们具有双重驱动的多语言-主权/在线数据验证程序,在不进行升级的验证时,我们具有双重驱动性的数据优势。