Large-scale pre-trained language models have shown remarkable results in diverse NLP applications. Unfortunately, these performance gains have been accompanied by a significant increase in computation time and model size, stressing the need to develop new or complementary strategies to increase the efficiency of these models. In this paper we propose DACT-BERT, a differentiable adaptive computation time strategy for BERT-like models. DACT-BERT adds an adaptive computational mechanism to BERT's regular processing pipeline, which controls the number of Transformer blocks that need to be executed at inference time. By doing this, the model learns to combine the most appropriate intermediate representations for the task at hand. Our experiments demonstrate that our approach, when compared to the baselines, excels on a reduced computational regime and is competitive in other less restrictive ones.
翻译:不幸的是,这些绩效收益伴随着计算时间和模型规模的大幅增加,强调需要制定新的或补充性战略来提高这些模型的效率。在本文件中,我们提议了DACT-BERT,这是类似于BERT的模型的一种不同的适应性计算时间战略。DACT-BERT在BERT的常规处理管道中增加了一个适应性计算机制,它控制了需要在推论时间执行的变换方块的数量。通过这样做,该模型学会了将手头任务最合适的中间代表组合在一起。我们的实验表明,我们的方法与基线相比,优于减少的计算制度,在其他限制性较小的模式中具有竞争力。