Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer's multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns per-head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradient descent. We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
翻译:多头关注是独立处理输入的不同部分的数种关注机制的集合,是变形器的关键成分。然而,最近的工作表明,变形器多头关注机制中的一大部分头可以安全地切除,而不会严重损害模型的性能;这种裁剪导致实际操作中明显较小和更快的模型。我们的工作引入了一种新的头裁技术,我们称之为可区分子裁剪。直觉上,我们的方法学习了每个头重要性变量,然后对未划线的头数施加了用户指定的硬性限制。重要变量通过随机梯度梯度下降来学习。我们在自然语言推断和机器翻译方面进行实验;我们显示,可区分的子裁剪的功能比以前的工作更具有可比性或更好性,同时提供准确的对宽度水平的控制。