Data imbalance is a common problem in the machine learning literature that can have a critical effect on the performance of a model. Various solutions exist - such as the ones that focus on resampling or data generation - but their impact on the convergence of gradient-based optimizers used in deep learning is not understood. We here elucidate the significant negative impact of data imbalance on learning, showing that the learning curves for minority and majority classes follow sub-optimal trajectories when training with a gradient-based optimizer. The reason is not only that the gradient signal neglects the minority classes, but also that the minority classes are subject to a larger directional noise, which slows their learning by an amount related to the imbalance ratio. To address this problem, we propose a new algorithmic solution, for which we provide a detailed analysis of its convergence behavior. We show both theoretically and empirically that this new algorithm exhibits a better behavior with more stable learning curves for each class, as well as a better generalization performance.
翻译:在机器学习文献中,数据不平衡是一个常见的问题,它能够对模型的性能产生重要影响。存在各种解决方案,例如侧重于再抽样或数据生成的解决方案,但是它们对于深层学习中使用的基于梯度的优化剂的趋同作用没有被理解到。我们在这里阐明了数据不平衡对学习的巨大负面影响,表明少数民族和多数阶层的学习曲线在接受基于梯度的优化器培训时遵循亚最佳轨迹。原因不仅是梯度信号忽略了少数群体的班级,而且少数群体班级受到更大的定向噪音的影响,这种噪音减缓了他们的学习速度,减缓了与不平衡率有关的学习速度。为了解决这一问题,我们提出了一种新的算法解决方案,对此我们详细分析其趋同行为。我们从理论上和从经验上表明,这种新的算法展示了一种更好的行为,每个班级的学习曲线更加稳定,以及更好的概括性表现。