Large transformer models can highly improve Answer Sentence Selection (AS2) tasks, but their high computational costs prevent their use in many real-world applications. In this paper, we explore the following research question: How can we make the AS2 models more accurate without significantly increasing their model complexity? To address the question, we propose a Multiple Heads Student architecture (named CERBERUS), an efficient neural network designed to distill an ensemble of large transformers into a single smaller model. CERBERUS consists of two components: a stack of transformer layers that is used to encode inputs, and a set of ranking heads; unlike traditional distillation technique, each of them is trained by distilling a different large transformer architecture in a way that preserves the diversity of the ensemble members. The resulting model captures the knowledge of heterogeneous transformer models by using just a few extra parameters. We show the effectiveness of CERBERUS on three English datasets for AS2; our proposed approach outperforms all single-model distillations we consider, rivaling the state-of-the-art large AS2 models that have 2.7x more parameters and run 2.5x slower. Code for our model is available at https://github.com/amazon-research/wqa-cerberus
翻译:大型变压器模型可以大大改进答案句选择任务( AS2), 但是它们高昂的计算成本阻止了它们在许多现实世界应用中的应用。 在本文中, 我们探索了以下研究问题: 我们如何使AS2模型更加精确, 同时又不显著地增加模型的复杂性? 为了解决这个问题, 我们提议了一个多头学生结构( CERBERUS ), 是一个高效的神经网络, 旨在将大型变压器的组合化成一个单一小模型。 CERBERUS 由两个部分组成: 用于编码输入的一组变压器层和一组排名头; 与传统的蒸馏技术不同, 每一个变压器模型都是通过蒸馏一个不同的大型变压器模型来培训的, 从而保护共性成员的多样性。 由此产生的模型通过仅仅使用几个额外的参数来捕捉到多元变压器模型的知识。 我们用三个AS2的英国数据集展示了CERBERUS的功效; 我们提出的方法超越了所有单模蒸馏系统, 与传统的蒸馏技术不同, 通过蒸馏技术, 将不同的是蒸馏一个不同的大 AS2 AS2 AS2 AS2 / ASBSqus 2.5 corrent sermax 。