Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes unimportant tokens as an input sequence passes through transformer layers. In particular, LTP prunes tokens with an attention score below a threshold value which is learned for each layer during training. Our threshold-based method allows the length of the pruned sequence to vary adaptively based on the input sequence, and avoids algorithmically expensive operations such as top-k token selection. We extensively test the performance of LTP on GLUE tasks and show that our method outperforms the prior state-of-the-art token pruning methods by up to ~2.5% higher accuracy with the same amount of FLOPs. In particular, LTP achieves up to 2.1x FLOPs reduction with less than 1% accuracy drop, which results in up to 1.9x and 2.0x throughput improvement on Intel Haswell CPUs and NVIDIA V100 GPUs, respectively. Furthermore, we demonstrate that LTP is more robust than prior methods to variations on input sentence lengths. Our code has been developed in PyTorch and has been open-sourced.
翻译:实际部署变压器模型具有挑战性, 原因是其推论成本, 以输入序列长度缩放。 为了解决这个问题, 我们展示了一部新颖的《 Token Prurning (LTP) 》 方法, 该方法通过变压器层, 以适应方式删除输入序列中的不重要符号。 特别是, LTP 光标, 其关注分数低于在训练期间为每一层所学的阈值。 我们的阈值方法允许经调整的序列长度根据输入序列进行适应性变异, 并避免了诸如顶级符号选择等费用昂贵的算法操作。 我们在 GLUE 任务上广泛测试了 LTP 的性能, 并显示我们的方法比先前的状态、 高级符号处理方法高出2.5 % 。 特别是, LTP 达到2.1x FLOPs 的降幅, 低于1% 的精度下降, 从而导致在 Intel Haswell 和 NVIIA V100 GPUPS 上进行价格的改进。 此外, 我们的LTP 和前期输入法的变更坚固。