Transformer models have achieved promising results on natural language processing (NLP) tasks including extractive question answering (QA). Common Transformer encoders used in NLP tasks process the hidden states of all input tokens in the context paragraph throughout all layers. However, different from other tasks such as sequence classification, answering the raised question does not necessarily need all the tokens in the context paragraph. Following this motivation, we propose Block-skim, which learns to skim unnecessary context in higher hidden layers to improve and accelerate the Transformer performance. The key idea of Block-Skim is to identify the context that must be further processed and those that could be safely discarded early on during inference. Critically, we find that such information could be sufficiently derived from the self-attention weights inside the Transformer model. We further prune the hidden states corresponding to the unnecessary positions early in lower layers, achieving significant inference-time speedup. To our surprise, we observe that models pruned in this way outperform their full-size counterparts. Block-Skim improves QA models' accuracy on different datasets and achieves 3 times speedup on BERT-base model.
翻译:在自然语言处理(NLP)任务方面,包括采掘答答题(QA),变换模型在自然语言处理(NLP)任务方面取得了可喜的成果。 NLP任务中使用了通用的变换器编码器,处理上下文段落中所有输入符号的隐藏状态。然而,与序列分类等其他任务不同,回答所提出的问题不一定需要上下文段落中的所有符号。根据这个动机,我们建议布洛克斯基姆(Block-skim)在较高隐蔽层中学习缩小不必要的环境,以改进和加速变换器的性能。布洛克-斯基姆(Block-Skim)的关键想法是确定必须进一步处理的背景和在推断过程中可以安全地早期丢弃的颜色。关键是,我们发现这些信息可以从变换模型中的自注意重量中充分衍生出。我们进一步细化了与低层早期不必要位置相对的隐藏状态,实现重大的推导时间加速。我们惊讶地看到,以这种方式运行的模型超越了全尺寸对应方。Block-Skimm改进了不同数据集模型的模型的精确度,并在不同数据位上实现加速。