Transformer-based models have achieved great success on sentence pair modeling tasks, such as answer selection and natural language inference (NLI). These models generally perform cross-attention over input pairs, leading to prohibitive computational costs. Recent studies propose dual-encoder and late interaction architectures for faster computation. However, the balance between the expressive of cross-attention and computation speedup still needs better coordinated. To this end, this paper introduces a novel paradigm MixEncoder for efficient sentence pair modeling. MixEncoder involves a light-weight cross-attention mechanism. It conducts query encoding only once while modeling the query-candidate interaction in parallel. Extensive experiments conducted on four tasks demonstrate that our MixEncoder can speed up sentence pairing by over 113x while achieving comparable performance as the more expensive cross-attention models.
翻译:以变换器为基础的模型在制模工作上取得了巨大成功,如答案选择和自然语言推断(NLI)等。这些模型通常对输入对进行交叉注意,从而导致令人望而生畏的计算成本。最近的研究提出了双重编码和延迟互动结构,以加快计算速度。然而,交叉注意和计算速度之间的平衡仍需要更好的协调。为此,本文件为高效的制模引入了一种新的模式MixEncoder。MixEncoder涉及一个轻量的交叉注意机制。它只同时进行一次查询编码,同时进行查询扫描互动的模拟。在四项任务上进行的广泛实验表明,我们的MixEncoder可以加速加速配对113x,同时实现作为更昂贵的交叉注意模型的可比性能。