在CIFAR10, ImageNet和VOC Segmentation等等任务中, AdaX既能比SGD快,又能和SGD一样好,可以说是非常惊喜了。我们在文章中还有其他更多的实验,欢迎大家移步arXiv看看,相关理论分析也在论文当中,也可以用代码试一试。从Adam改到AdaX变化非常的小,很容易实现。期待大家的结果。参考文献:[1]. Reddi, S. J., Kale, S., and Kumar., S. Onthe convergence of adam and beyond.Proceedings of the 6th InternationalConference on Learning Representations (ICLR), 2018. [2]. Zhou, Z., Zhang, Q., Lu, G., Wang, H., Zhang, W., and Yu, Y. Adashift:Decorrelation and convergence of adaptive learning rate methods. Proceedingsof 7th International Conference on Learning Representations (ICLR), 2019.[3]. Huang, H., Wang, C., and Dong., B.Nostalgic adam: Weighting more of the past gradients when designing theadaptive learning rate. arXiv preprint arXiv: 1805.07557,2019. [4]. Luo, L., Xiong, Y., Liu, Y., and Sun, X. Adaptive gradi- ent methods withdynamic bound of learning rate. Proceedings of 7th InternationalConference on Learning Representations, 2019.[5]. Liu L., Jiang, H., He, P., Chen, W., Liu,X., Gao, J., Han, J. On the Variance of the Adaptive Learning Rate andBeyond. Proceedings of 8th International Conference on LearningRepresentations, 2020.