具有可学习激活功能的变换器 (Transformers with Learnable Activation Functions)

Activation functions can have a significant impact on reducing the topological complexity of input data and therefore improve the performance of the model. Selecting a suitable activation function is an essential step in neural model design. However, the choice of activation function is seldom discussed or explored in Transformer-based language models. Their activation functions are chosen beforehand and then remain fixed from pre-training to fine-tuning. As a result, the inductive biases they imposed on models cannot be adjusted during this long life cycle. Moreover, subsequently developed models (e.g., RoBERTa, BART, and GPT-3) often follow up prior work (e.g., BERT) to use the same activation function without justification. In this paper, we investigate the effectiveness of using Rational Activation Function (RAF), a learnable activation function, in the Transformer architecture. In contrast to conventional, predefined activation functions, RAFs can adaptively learn optimal activation functions during training according to input data. Our experiments show the RAF-based Transformer (RAFT) achieves a lower validation perplexity than a vanilla BERT with the GELU function. We further evaluate RAFT on downstream tasks in low- and full-data settings. Our results show that RAFT outperforms the counterpart model across the majority of tasks and settings. For instance, RAFT outperforms vanilla BERT on the GLUE benchmark by 5.71 points on average in low-data scenario (where 100 training examples are available) and by 2.05 points on SQuAD in full-data setting. Analysis of the shapes of learned RAFs further unveils that they substantially vary between different layers of the pre-trained model and mostly look very different from conventional activation functions. RAFT opens a new research direction for analyzing and interpreting pre-trained models according to the learned activation functions.

翻译：激活功能可以对降低输入数据的表层复杂性产生重大影响,从而改善模型的性能。选择合适的激活功能是神经模型设计的一个基本步骤。但是,在基于变异器的语言模型中很少讨论或探索激活功能的选择。启动功能是事先选择的, 然后从培训前到微调都固定下来。因此, 给模型施加的诱导偏差无法在这一漫长生命周期内进行调整。此外, 随后开发的模型( 如 RoBERTA、 BART 和 GPT-3) 经常跟进先前的工作( 如 BERT) 来在没有理由的情况下使用相同的激活功能。但是, 我们在本文件中, 调查使用 Riralnal 激活功能( RAFA) 的有效性。与传统的、预先定义的激活功能相比, RAFS 在培训期间根据输入数据数据对模型进行适应性地学习最优化的激活功能。我们的实验显示, RAFA- FOR 的低变异变( ORT) 在 Vanilla 2. 和 VER 完全的 VERA- RAFA 格式中, 我们的变换式的变换式的变式的变式的变式的变式的变式的变式的变换式的变式的变式的变式的变换式的变式的变式的变式的变更式的变式的变式的变式的变式的变式函数功能是, 。

相关内容

激活函数

关注 44

在人工神经网络中，给定一个输入或一组输入，节点的激活函数定义该节点的输出。一个标准集成电路可以看作是一个由激活函数组成的数字网络，根据输入的不同，激活函数可以是开(1)或关(0)。这类似于神经网络中的线性感知器的行为。然而，只有非线性激活函数允许这样的网络只使用少量的节点来计算重要问题，并且这样的激活函数被称为非线性。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

机器学习损失函数概述，Loss Functions in Machine Learning

专知会员服务

84+阅读 · 2022年3月19日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日