PALBERT: 向花粉教授ALBERT (PALBERT: Teaching ALBERT to Ponder)

Currently, pre-trained models can be considered the default choice for a wide range of NLP tasks. Despite their SoTA results, there is practical evidence that these models may require a different number of computing layers for different input sequences, since evaluating all layers leads to overconfidence in wrong predictions (namely overthinking). This problem can potentially be solved by implementing adaptive computation time approaches, which were first designed to improve inference speed. Recently proposed PonderNet may be a promising solution for performing an early exit by treating the exit layer's index as a latent variable. However, the originally proposed exit criterion, relying on sampling from trained posterior distribution on the probability of exiting from the $i$-th layer, introduces major variance in exit layer indices, significantly reducing the resulting model's performance. In this paper, we propose improving PonderNet with a novel deterministic Q-exit criterion and a revisited model architecture. We adapted the proposed mechanism to ALBERT and RoBERTa and compared it with recent methods for performing an early exit. We observed that the proposed changes can be considered significant improvements on the original PonderNet architecture and outperform PABEE on a wide range of GLUE tasks. In addition, we also performed an in-depth ablation study of the proposed architecture to further understand Lambda layers and their performance.

翻译：目前,经过事先培训的模型可以被视为对广泛的非长期规划任务的一种默认选择。尽管它们的 SoTA结果,但实际证据表明,这些模型可能需要不同输入序列的计算层数量不同,因为评估所有层次都会导致对错误预测的过度信心(即过度思考),这个问题可以通过采用适应性计算时间方法来解决,而适应性计算时间方法首先设计是为了提高推论速度。最近提出的PonderNet可能是实现早期退出的一个有希望的解决方案,办法是将退出层指数作为潜在的变量对待。然而,最初提出的退出标准,依靠经过培训的后端分布对从美元层退出的概率进行抽样抽样,在退出层指数方面造成了重大差异,大大降低了由此形成的模型性能。在本文件中,我们建议用一种新的确定性Q-出境标准和重新审视的模型结构来改进PonderNet。我们把拟议的机制调整为ALBERTER和RoBERTA, 与最近的退出方法进行比较。我们发现,拟议的修改建议的变化可以被视为对原始PonderNet结构结构结构进行重大改进,并且超越了拟议中的GABE-LU的绩效范围。