Existing pre-trained language models (PLMs) are often computationally expensive in inference, making them impractical in various resource-limited real-world applications. To address this issue, we propose a dynamic token reduction approach to accelerate PLMs' inference, named TR-BERT, which could flexibly adapt the layer number of each token in inference to avoid redundant calculation. Specially, TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning. The experimental results on several downstream NLP tasks show that TR-BERT is able to speed up BERT by 2-5 times to satisfy various performance demands. Moreover, TR-BERT can also achieve better performance with less computation in a suite of long-text tasks since its token-level layer number adaption greatly accelerates the self-attention operation in PLMs. The source code and experiment details of this paper can be obtained from https://github.com/thunlp/TR-BERT.
翻译:为解决这一问题,我们提议一种动态的象征性削减方法,以加速PLMs的推断,称为TR-BERT, 它可以灵活地调整每个象征性的推理的层数,以避免重复计算。特别是,TR-BERT将象征性削减过程作为一个多步的象征性选择问题,并通过强化学习自动学习选择战略。一些下游NLP任务的实验结果显示TR-BERT能够加速BERT的2-5倍,以满足各种性能需求。此外,TR-BERT还可以在一系列长文本任务中以较少的计算实现更好的业绩,因为其象征性层数的调整大大加快了PLMS的自控操作。可从https://github.com/thunp/TR-BERT获得该文件的来源代码和实验细节。