We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup on CPU and GPU. This leads to better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for sequence generation. This strategy performs reasonably well: it has higher diversity and has only slightly inferior performance to the best deterministic non-linearity, SILU, combined with temperature sampling. This provides an alternative way to increase the diversity of generated text.
翻译:本文提出随机激活函数这一新颖策略。该策略在大型语言模型的前馈层中随机选择多种非线性函数进行激活。具体而言,我们依据伯努利分布的结果在SILU与RELU之间进行选择。该策略有效规避了RELU函数存在的优化问题——即负值输入区间的恒定形态会阻碍梯度流动。我们通过两种方式利用此策略:(1)在预训练阶段采用随机激活函数,随后使用RELU对模型进行微调,推理阶段则采用RELU生成稀疏隐向量。该方法降低了推理时的浮点运算量,在CPU和GPU上均实现了显著加速,且效果优于直接使用RELU激活函数从头训练的模型。(2)我们将随机激活函数应用于序列生成任务进行评估。该策略表现出良好性能:在保持与最佳确定性非线性函数SILU(配合温度采样)相近性能的同时,显著提升了生成文本的多样性,为增强文本生成多样性提供了新的技术路径。