We present DR.DECR (Dense Retrieval with Distillation-Enhanced Cross-Lingual Representation), a new cross-lingual information retrieval (CLIR) system trained using multi-stage knowledge distillation (KD). The teacher of DR.DECR relies on a highly effective but computationally expensive two-stage inference process consisting of query translation and monolingual IR, while the student, DR.DECR, executes a single CLIR step. We teach DR.DECR powerful multilingual representations as well as CLIR by optimizing two corresponding KD objectives. Learning useful representations of non-English text from an English-only retriever is accomplished through a cross-lingual token alignment algorithm that relies on the representation capabilities of the underlying multilingual encoders. In both in-domain and zero-shot out-of-domain evaluation, DR.DECR demonstrates far superior accuracy over direct fine-tuning with labeled CLIR data. It is also the best single-model retriever on the XOR-TyDi benchmark at the time of this writing.
翻译:我们介绍了DR.DECR(通过蒸馏增强跨语言代表系统进行大量检索),这是一个使用多阶段知识蒸馏(KD)培训的新的跨语言信息检索系统(CLIR),DR.DECR的教师依赖一个非常有效但计算成本高昂的两阶段推论过程,包括查询翻译和单语语言的IR,而学生DR.DECR则执行一个单一的CLIR步骤。我们通过优化两个相应的 KD目标,教授DR.DECR强大的多语种代表以及CLIR。从一个只使用英语的检索器学习非英语文本的有用表达方式,是通过一种跨语言的象征性比对算法,该算法依赖于基本的多语种编码器的表达能力。在内部和零投射的外部评价中,DR.DECR显示出比带有标签的CLIR数据直接微调的准确性要高得多。在编写本报告时,它也是XOR-Tydi基准的最佳单一模型检索器。