Transformer architecture has become the de-facto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational inefficiency of Transformer-based models is that they spend the identical amount of computation throughout all layers. Prior works have proposed to augment the Transformer model with the capability of skimming tokens to improve its computational efficiency. However, they suffer from not having effectual and end-to-end optimization of the discrete skimming predictor. To address the above limitations, we propose the Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer. The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers. The key idea in Transkimmer is to add a parameterized predictor before each layer that learns to make the skimming decision. We also propose to adopt reparameterization trick and add skim loss for the end-to-end training of Transkimmer. Transkimmer achieves 10.97x average speedup on GLUE benchmark compared with vanilla BERT-base baseline with less than 1% accuracy degradation.
翻译:变换器结构已经成为许多自然语言处理和计算机视觉的机器学习任务的脱法模型。 因此, 提高它的计算效率至关重要。 以变换器为基础的模型的主要计算效率之一是它们在所有层次上花费相同的计算量。 先前的工程提议, 将变换器模型扩大, 使其具有滑动符号的能力, 以提高其计算效率。 但是, 它们因离散的滑动预测器没有效果和端到端的优化而受到影响。 为了应对上述限制, 我们提议了 Transkimmer 结构, 它学会识别每个层次都不需要的隐藏状态符号。 以变换器为基础的模型随后直接传送到最后输出, 从而减少连续层的计算。 Transkimmer 的关键思想是在学习滑动决定的每个层次之前添加一个参数化的预测器。 我们还提议, 在 Transkimmer 的端到端培训中采用重新计法, 并增加滑动器损失。 Transkimmer 实现GLUE基准的10.97x平均速度, 比vanLE- basil 更低的精确度基准。