We present a new cross-lingual information retrieval (CLIR) model trained using multi-stage knowledge distillation (KD). The teacher and the student are heterogeneous systems-the former is a pipeline that relies on machine translation and monolingual IR, while the latter executes a single CLIR operation. We show that the student can learn both multilingual representations and CLIR by optimizing two corresponding KD objectives. Learning multilingual representations from an English-only retriever is accomplished using a novel cross-lingual alignment algorithm that greedily re-positions the teacher tokens for alignment. Evaluation on the XOR-TyDi benchmark shows that the proposed model is far more effective than the existing approach of fine-tuning with cross-lingual labeled IR data, with a gain in accuracy of 25.4 Recall@5kt.
翻译:我们提出了一个新的跨语言信息检索模式(CLIR),该模式是使用多阶段知识蒸馏(KD)培训的。 教师和学生是多种系统 -- -- 前者是一个依赖机器翻译和单一语言IR的管道,而后者则执行一个单一的CLIR操作。我们表明,学生可以通过优化两个相应的KD目标来学习多语言表达和CLIR。学习英语专用检索器的多语言表达方法,是使用一种新的跨语言协调算法完成的,该算法贪婪地将教师的标语重新定位为匹配。 对 XOR-TyDi 基准的评估表明,拟议的模型比目前对跨语言标签IR数据进行微调的方法要有效得多,其准确度为25.4 recall@5kt。