Activation functions play a pivotal role in determining the training dynamics and neural network performance. The widely adopted activation function ReLU despite being simple and effective has few disadvantages including the Dying ReLU problem. In order to tackle such problems, we propose a novel activation function called Serf which is self-regularized and nonmonotonic in nature. Like Mish, Serf also belongs to the Swish family of functions. Based on several experiments on computer vision (image classification and object detection) and natural language processing (machine translation, sentiment classification and multimodal entailment) tasks with different state-of-the-art architectures, it is observed that Serf vastly outperforms ReLU (baseline) and other activation functions including both Swish and Mish, with a markedly bigger margin on deeper architectures. Ablation studies further demonstrate that Serf based architectures perform better than those of Swish and Mish in varying scenarios, validating the effectiveness and compatibility of Serf with varying depth, complexity, optimizers, learning rates, batch sizes, initializers and dropout rates. Finally, we investigate the mathematical relation between Swish and Serf, thereby showing the impact of preconditioner function ingrained in the first derivative of Serf which provides a regularization effect making gradients smoother and optimization faster.
翻译:激活功能在确定培训动态和神经网络性能方面发挥着关键作用。尽管广泛采用的激活功能ReLU是简单而有效的,但它几乎没有什么缺点,包括Dying ReLU问题。为了解决这些问题,我们提议了一个叫Serf的新型激活功能,这个功能在性质上是自我正规化的,不具有运动性。像Mish一样,Serf也属于Swish功能大家庭。基于计算机视觉(图像分类和物体探测)和自然语言处理(机器翻译、情绪分类和多式联运要求)等不同先进结构的任务的若干实验,发现Serf大大超越了ReLU(基线)和其他激活功能,包括Swish和Mish,在更深的建筑上有很大的优势。通货膨胀研究进一步表明,Serf的架构在不同情景下比Shish和Mish的功能要好,检验Serf与不同深度、复杂性、优化、学习率、批量尺寸、初始化器和辍学率等任务的有效性和兼容性。最后,我们调查Serfrain的数学关系在Swish和Serrevelimer的升级中首次展示了Serregildal 和Serregraphildalimlap 之间,从而显示了Sermalalimermaint 和Sergildalimpalimpal 的更高化效果。