As the problems to be optimized with deep learning become more practical, their datasets inevitably contain a variety of noise, such as mislabeling and substitution by estimated inputs/outputs, which would have negative impacts on the optimization results. As a safety net, it is a natural idea to improve a stochastic gradient descent (SGD) optimizer, which updates the network parameters as the final process of learning, to be more robust to noise. The related work revealed that the first momentum utilized in the Adam-like SGD optimizers can be modified based on the noise-robust student's t-distribution, resulting in inheriting the robustness to noise. In this paper, we propose AdaTerm, which derives not only the first momentum but also all the involved statistics based on the student's t-distribution. If the computed gradients seem to probably be aberrant, AdaTerm is expected to exclude the computed gradients for updates, and reinforce the robustness for the next updates; otherwise, it updates the network parameters normally, and can relax the robustness for the next updates. With this noise-adaptive behavior, the excellent learning performance of AdaTerm was confirmed via typical optimization problems with several cases where the noise ratio would be different.
翻译:随着深层学习的问题变得更加实用,它们的数据集不可避免地含有各种噪音,例如错误标签和以估计投入/产出替代,这将对优化结果产生消极影响。作为一个安全网,改进随机梯度下降优化器(SGD)是一个自然的想法,它更新网络参数,作为最后学习过程,从而对噪音更加有力。相关工作显示,在类似亚当的SGD优化器中使用的第一个动力可以根据噪音-robust学生的T分布进行修改,从而继承对噪音的稳健性。在本文件中,我们提议Adaterm,它不仅产生第一个动力,而且根据学生的t分布提供所有相关统计数据。如果计算梯度看起来可能是异常的,Adaterm将会排除计算出的升级梯度,并增强下一次更新的稳健性;否则,它通常更新网络参数,并能够放松下一次更新的稳健性。由于这种噪音适应性能不仅产生第一种势头,而且根据学生的t分配情况产生所有相关的统计数据。如果计算出的梯度看起来可能异常,那么Adater将排除用于更新的计算梯度,并且加强下次更新的稳健性;否则,它通常更新网络参数,并能够放松下一次更新。