Activation functions can have a significant impact on reducing the topological complexity of input data and therefore improve the performance of the model. Selecting a suitable activation function is an essential step in neural model design. However, the choice of activation function is seldom discussed or explored in Transformer-based language models. Their activation functions are chosen beforehand and then remain fixed from pre-training to fine-tuning. As a result, the inductive biases they imposed on models cannot be adjusted during this long life cycle. Moreover, subsequently developed models (e.g., RoBERTa, BART, and GPT-3) often follow up prior work (e.g., BERT) to use the same activation function without justification. In this paper, we investigate the effectiveness of using Rational Activation Function (RAF), a learnable activation function, in the Transformer architecture. In contrast to conventional, predefined activation functions, RAFs can adaptively learn optimal activation functions during training according to input data. Our experiments show the RAF-based Transformer (RAFT) achieves a lower validation perplexity than a vanilla BERT with the GELU function. We further evaluate RAFT on downstream tasks in low- and full-data settings. Our results show that RAFT outperforms the counterpart model across the majority of tasks and settings. For instance, RAFT outperforms vanilla BERT on the GLUE benchmark by 5.71 points on average in low-data scenario (where 100 training examples are available) and by 2.05 points on SQuAD in full-data setting. Analysis of the shapes of learned RAFs further unveils that they substantially vary between different layers of the pre-trained model and mostly look very different from conventional activation functions. RAFT opens a new research direction for analyzing and interpreting pre-trained models according to the learned activation functions.
翻译:激活功能可以对降低输入数据的表层复杂性产生重大影响,从而改善模型的性能。选择合适的激活功能是神经模型设计的一个基本步骤。 但是,在基于变异器的语言模型中很少讨论或探索激活功能的选择。 启动功能是事先选择的, 然后从培训前到微调都固定下来。 因此, 给模型施加的诱导偏差无法在这一漫长生命周期内进行调整。 此外, 随后开发的模型( 如 RoBERTA、 BART 和 GPT-3) 经常跟进先前的工作( 如 BERT) 来在没有理由的情况下使用相同的激活功能。 但是, 我们在本文件中, 调查使用 Riralnal 激活功能( RAFA) 的有效性。 与传统的、 预先定义的激活功能相比, RAFS 在培训期间根据输入数据数据对模型进行适应性地学习最优化的激活功能。 我们的实验显示, RAFA- FOR 的低变异变( ORT) 在 Vanilla 2. 和 VER 完全的 VERA- RAFA 格式中, 我们的变换式的变换式的变式的变式的变式的变式的变式的变式的变换式的变式的变式的变式的变换式的变式的变式的变式的变更式的变式的变式的变式的变式的变式函数功能是, 。