Interactive and non-interactive model are the two de-facto standard frameworks in vector-based cross-lingual information retrieval (V-CLIR), which embed queries and documents in synchronous and asynchronous fashions, respectively. From the retrieval accuracy and computational efficiency perspectives, each model has its own superiority and shortcoming. In this paper, we propose a novel framework to leverage the advantages of these two paradigms. Concretely, we introduce semi-interactive mechanism, which builds our model upon non-interactive architecture but encodes each document together with its associated multilingual queries. Accordingly, cross-lingual features can be better learned like an interactive model. Besides, we further transfer knowledge from a well-trained interactive model to ours by reusing its word embeddings and adopting knowledge distillation. Our model is initialized from a multilingual pre-trained language model M-BERT, and evaluated on two open-resource CLIR datasets derived from Wikipedia and an in-house dataset collected from a real-world search engine. Extensive analyses reveal that our methods significantly boost the retrieval accuracy while maintaining the computational efficiency.
翻译:互动和非互动模式是矢量跨语言信息检索中的两个脱facto标准框架(V-CLIR),它们分别以同步和不同步的方式嵌入查询和文件。从检索准确性和计算效率角度,每个模型都有其自身的优势和缺点。在本文中,我们提出了一个新的框架来利用这两个模式的优势。具体地说,我们引入半互动机制,在非互动结构上构建我们的模型,但将每个文件与相关的多语种查询编码起来。因此,跨语言特征可以像互动模式一样更好地学习。此外,我们通过重新使用其语言嵌入和采用知识蒸馏,进一步将训练有素的互动模式的知识从经过良好训练的知识转移给我们。我们的模型是从多语种预先培训的语言模型M-BERT初始化的,对来自维基百科的两套开放资源 CLIR数据集和从现实世界搜索引擎收集的内部数据集进行了评估。广泛的分析表明,我们的方法极大地提高了检索的准确性,同时保持计算效率。