We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text.
翻译:本文提出随机激活函数策略。该创新方法在大语言模型的前馈层中随机选择多种非线性函数进行激活。具体而言,我们根据伯努利分布随机选择SILU或RELU函数。该策略有效规避了RELU函数固有的优化难题——即负值输入区间的恒定形态会阻碍梯度流动。我们通过两种方式应用此策略:(1)在预训练阶段采用随机激活函数,随后使用RELU对模型进行微调,并在推理阶段利用RELU生成稀疏隐向量。该方法显著降低了推理时的浮点运算量,在CPU上实现了可观的加速效果。值得注意的是,相比直接使用RELU激活函数从头训练模型,该策略能获得更优的性能表现。(2)我们评估了随机激活函数在文本生成任务中的效果。该策略表现出良好的性能:仅略逊于最优确定性非线性方案(即SILU结合温度缩放技术)。这为现有文本生成策略提供了新思路,通过可控方式有效提升了生成文本的多样性。