Language Models pretrained on large textual data have been shown to encode different types of knowledge simultaneously. Traditionally, only the features from the last layer are used when adapting to new tasks or data. We put forward that, when using or finetuning deep pretrained models, intermediate layer features that may be relevant to the downstream task are buried too deep to be used efficiently in terms of needed samples or steps. To test this, we propose a new layer fusion method: Depth-Wise Attention (DWAtt), to help re-surface signals from non-final layers. We compare DWAtt to a basic concatenation-based layer fusion method (Concat), and compare both to a deeper model baseline -- all kept within a similar parameter budget. Our findings show that DWAtt and Concat are more step- and sample-efficient than the baseline, especially in the few-shot setting. DWAtt outperforms Concat on larger data sizes. On CoNLL-03 NER, layer fusion shows 3.68-9.73% F1 gain at different few-shot sizes. The layer fusion models presented significantly outperform the baseline in various training scenarios with different data sizes, architectures, and training constraints.
翻译:在大量文本数据方面经过预先培训的语言模型显示,可以同时对不同类型的知识进行编码。传统上,在适应新的任务或数据时,只使用最后一层的特征。我们提出,在使用或微调深预先训练模型时,可能与下游任务有关的中间层特征埋得太深,无法在所需的样本或步骤方面有效使用。为了测试这一点,我们提议一种新的层融合方法:深度-注意(DWat),以帮助非最后层的再浮现信号。我们将DWat与基于基本聚合的层融合方法(Concat)进行比较,并将两者都与更深的模型基线进行比较 -- -- 都保存在类似的参数预算内。我们的调查结果显示,DWat和Concat比基线更有步骤和样本效率,特别是在几发的设置中。DWatt在更大的数据尺寸上比Concat更高级。关于Concat的CNER(DWL-03),层融合显示3.68-9.73 % F1在不同的小片尺寸上获得。在不同的层融合模型中展示了不同的基线,不同的培训模型和各种数据结构的制约。