Transformer-based language models (LMs) create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its intermediate representations. One way to elucidate this is to cast the hidden representations as final representations, bypassing the transformer computation in-between. In this work, we suggest a simple method for such casting, by using linear transformations. We show that our approach produces more accurate approximations than the prevailing practice of inspecting hidden representations from all layers in the space of the final layer. Moreover, in the context of language modeling, our method allows "peeking" into early layer representations of GPT-2 and BERT, showing that often LMs already predict the final output in early layers. We then demonstrate the practicality of our method to recent early exit strategies, showing that when aiming, for example, at retention of 95% accuracy, our approach saves additional 7.9% layers for GPT-2 and 5.4% layers for BERT, on top of the savings of the original approach. Last, we extend our method to linearly approximate sub-modules, finding that attention is most tolerant to this change.
翻译:以变换器为基础的语言模型(LMS)在每一层中创建其投入的隐性表达方式,但只能使用最终的表示方式进行预测。这模糊了模型的内部决策过程及其中间表示方式的效用。 解释这一点的方法之一是将隐藏的表示方式作为最后表示方式,绕过变压器之间的计算。 在这项工作中,我们建议一种简单的铸造方法,使用线性转换方法。 我们表明,我们的方法比在最后一层空间中从所有层中检查隐性表示方式的普遍做法产生更准确的近似值。 此外,在语言建模方面,我们的方法允许“显示”GPT-2和BERT的早期表示方式,表明LMS往往已经预测早期的终极输出。 然后我们展示了我们的方法对最近的早期退出战略的实用性,表明,例如,在保留95%的精确度时,我们的方法可以节省GPT-2和BERT-2的附加7.9%的层和5.4%的层,在最初的节省量之外。 最后,我们将我们的方法扩大到直线性近于子模块的变化,发现这种注意是最容忍性的。</s>