Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.
翻译:LaBSE(Feng等人,2022年)等大规模语言不可知性判决嵌入模型等大型语言不可知性嵌入模型获得了平行判决调整的最先进性能。 但是,这些大规模模型可能受到推论速度和计算间接费用的影响。 本研究系统地探索学习与轻量级模型嵌入的语言不可知性判决。 我们证明,薄薄薄的编码器可以为109种语言构建强大的低维判决嵌入。 通过我们提议的蒸馏方法,我们通过吸收教师模型的知识取得了进一步的改进。 Tatoeba、联合国和BUCCC公司的经验性结果展示了我们的轻量模型的有效性。我们释放了我们的轻量语言不可知性判决嵌入模型LEALLALA在TensorFlow Chub上。