Dynamic early exiting aims to accelerate the inference of pre-trained language models (PLMs) by emitting predictions in internal layers without passing through the entire model. In this paper, we empirically analyze the working mechanism of dynamic early exiting and find that it faces a performance bottleneck under high speed-up ratios. On one hand, the PLMs' representations in shallow layers lack high-level semantic information and thus are not sufficient for accurate predictions. On the other hand, the exiting decisions made by internal classifiers are unreliable, leading to wrongly emitted early predictions. We instead propose a new framework for accelerating the inference of PLMs, CascadeBERT, which dynamically selects proper-sized and complete models in a cascading manner, providing comprehensive representations for predictions. We further devise a difficulty-aware objective, encouraging the model to output the class probability that reflects the real difficulty of each instance for a more reliable cascading mechanism. Experimental results show that CascadeBERT can achieve an overall 15\% improvement under 4$\times$ speed-up compared with existing dynamic early exiting methods on six classification tasks, yielding more calibrated and accurate predictions.
翻译:早期动态退出的目的是加快预先培训的语言模型(PLM)的推论,方法是在不经过整个模型的情况下在内部层次上发布预测,从而加快预培训语言模型(PLM)的推论。在本文中,我们通过经验分析动态早期退出的工作机制,发现其在高速比率下面临性能瓶颈。一方面,PLM在浅层的表述缺乏高层次的语义信息,因此不足以准确预测。另一方面,内部分类员的退出决定不可靠,导致错误地发布早期预测。我们相反,我们提出了一个新的框架,以加速PLMS(CascadeBERT)的推论,即CascadeBERT以动态方式以递增方式选择适当和完整的模型,为预测提供全面的表述。我们进一步设计了一个难测目标,鼓励模型输出反映每例真实难度的概率,从而得出更可靠的测深机制。实验结果表明,CascadeBERT(CacadeBERT)可以在4\timetimate 在4$的加速下实现总体的改进。