The transformer multi-head self-attention mechanism has been thoroughly investigated recently. On one hand, researchers are interested in understanding why and how transformers work. On the other hand, they propose new attention augmentation methods to make transformers more accurate, efficient and interpretable. In this paper, we synergize these two lines of research in a human-in-the-loop pipeline to first find important task-specific attention patterns. Then those patterns are applied, not only to the original model, but also to smaller models, as a human-guided knowledge distillation process. The benefits of our pipeline are demonstrated in a case study with the extractive summarization task. After finding three meaningful attention patterns in the popular BERTSum model, experiments indicate that when we inject such patterns, both the original and the smaller model show improvements in performance and arguably interpretability.
翻译:最近对变压器多头自留机制进行了彻底调查。一方面,研究人员有兴趣了解变压器为何和如何运作。另一方面,他们提出新的关注增强方法,以使变压器更加准确、高效和易于解释。在本文中,我们在人行中将这两条研究线协同起来,首先找到重要的任务关注模式。然后,这些模式不仅适用于原始模式,而且适用于较小的模型,作为人类指导的知识蒸馏过程。我们输压管的效益在一项采掘合成任务案例研究中得到了证明。在找到流行的BERTSum模型的三个有意义的关注模式之后,实验表明,当我们输入这种模式时,原始和较小的模型都显示业绩的改善和可论证的解释性。