A major challenge in deploying transformer models is their prohibitive inference cost, which quadratically scales with the input sequence length. This makes it especially difficult to use transformers for processing long sequences. To address this, we present a novel Learned Token Pruning (LTP) method that reduces redundant tokens as the data passes through the different layers of the transformer. In particular, LTP prunes tokens with an attention score below a threshold value, which is learned during training. Importantly, our threshold based method avoids algorithmically expensive operations such as top-k token selection which are used in prior token pruning methods, and also leads to structured pruning. We extensively test the performance of our approach on multiple GLUE tasks and show that our learned threshold based method consistently outperforms the prior state-of-the-art top-k token based method by up to ~2% higher accuracy with the same amount of FLOPs. Furthermore, our preliminary results show up to 1.4x and 1.9x throughput improvement on Tesla T4 GPU and Intel Haswell CPU, respectively, with less than 1% of accuracy drop (and up to 2.1x FLOPs reduction). Our code has been developed in PyTorch and has been open-sourced.
翻译:在部署变压器模型方面,一个重大挑战是其令人望而却步的推导成本,这种推导成本随输入序列长度的跨度而不同。这使得使用变压器处理长序列特别困难。为了解决这个问题,我们提出了一个小说Token Prurning (LTP) 方法,该方法随着数据通过变压器的不同层而减少多余的标牌。特别是,LTP Prunes 标记,其关注分数低于阈值,这是在培训期间学到的。 重要的是,我们基于阈值的方法避免了在算法上昂贵的操作,如在前代代代托盘处理方法中使用的顶级标牌选择,并导致结构化的裁剪裁。我们广泛测试了我们在多个 GLUE 任务上的方法的性能,并表明我们所学的门槛值始终高于先前的状态顶级标牌,其精度比限值要高出2%,与培训期间所学的数值相同。此外,我们的初步结果显示,Tesla T4 GPU 和 Intel Haswell CPU 和 Intel CPU, 已经分别开发了1.x 和FPRx 降低了1.x 。