We consider zero-shot cross-lingual transfer in legal topic classification using the recent MultiEURLEX dataset. Since the original dataset contains parallel documents, which is unrealistic for zero-shot cross-lingual transfer, we develop a new version of the dataset without parallel documents. We use it to show that translation-based methods vastly outperform cross-lingual fine-tuning of multilingually pre-trained models, the best previous zero-shot transfer method for MultiEURLEX. We also develop a bilingual teacher-student zero-shot transfer approach, which exploits additional unlabeled documents of the target language and performs better than a model fine-tuned directly on labeled target language documents.
翻译:由于原始数据集包含平行文件,对于零点点跨语言传输来说是不现实的,因此我们开发了没有平行文件的数据集新版本。我们用它来显示,基于翻译的方法大大优于多语种预先培训模式的跨语言微调,这是以前对多语种预先培训模式的最佳零点传输方法。我们还开发了双语教师-学生零点传输方法,利用了更多未标注的目标语言文件,并比直接对标签目标语言文件进行微调的模型更好。