Prompt tuning is an emerging way of adapting pre-trained language models to downstream tasks. However, the existing studies are mainly to add prompts to the input sequence. This way would not work as expected due to the intermediate multi-head self-attention and feed-forward network computation, making model optimization not very smooth. Hence, we propose a novel tuning way called layer tuning, aiming to add learnable parameters in Transformer layers. Specifically, we focus on layer tuning for feed-forward network in the Transformer, namely FL-tuning. It introduces additional units into the hidden layer of each feed-forward network. We conduct extensive experiments on the public CLUE benchmark. The results show that: 1) Our FL-tuning outperforms prompt tuning methods under both full-data and few-shot settings in almost all cases. In particular, it improves accuracy by 17.93% (full-data setting) on WSC 1.0 and F1 by 16.142% (few-shot setting) on CLUENER over P-tuning v2. 2) Our FL-tuning is more stable and converges about 1.17 times faster than P-tuning v2. 3) With only about 3% of Transformer's parameters to be trained, FL-tuning is comparable with fine-tuning on most datasets, and significantly outperforms fine-tuning (e.g., accuracy improved by 12.9% on WSC 1.1) on several datasets. The source codes are available at https://github.com/genggui001/FL-Tuning.
翻译:快速调试是使训练前语言模型适应下游任务的一种新兴方式。 但是, 现有的研究主要是在输入序列中添加提示。 由于中间多头自我注意和反馈前网络计算, 使得模型优化不十分顺利, 使得快速调试是一个叫分层调试的新调试方式, 目的是在变异器层中增加可学习的参数。 具体地说, 我们侧重于变异器( FLL- 调试) 中反馈前向网络的层调试, 即 FL- 调试。 它向每个进料前网络的隐藏层引入额外的单位。 我们对公共 CLUE 基准进行了广泛的实验。 结果显示:(1) 我们的FL调试比全数据和几发式网络的快速调试制方法都要顺利。 特别是, 它提高了WSC 1.0 和 F1 的精准度17.93% (全数据设置) 16. 14. 14. 2, 我们的FL- 调试调( 调) 的精细度和趋近1. 17倍于P-L 调的精度, 数据比F- 调的精度要快。