Real-world large-scale datasets are heteroskedastic and imbalanced -- labels have varying levels of uncertainty and label distributions are long-tailed. Heteroskedasticity and imbalance challenge deep learning algorithms due to the difficulty of distinguishing among mislabeled, ambiguous, and rare examples. Addressing heteroskedasticity and imbalance simultaneously is under-explored. We propose a data-dependent regularization technique for heteroskedastic datasets that regularizes different regions of the input space differently. Inspired by the theoretical derivation of the optimal regularization strength in a one-dimensional nonparametric classification setting, our approach adaptively regularizes the data points in higher-uncertainty, lower-density regions more heavily. We test our method on several benchmark tasks, including a real-world heteroskedastic and imbalanced dataset, WebVision. Our experiments corroborate our theory and demonstrate a significant improvement over other methods in noise-robust deep learning.
翻译:现实世界的大型数据集是热心和不平衡的 -- 标签具有不同程度的不确定性和标签分布是长尾的。 热心和不平衡对深学习算法提出了挑战,因为很难区分标签错误、模糊和稀有的例子。 解决热心和不平衡的问题同时受到探索不足。 我们建议对热心数据集采用数据依赖的正规化技术,对输入空间的不同区域进行不同的规范。 在单维非参数分类设置中最佳正规化强度的理论衍生的启发下,我们的方法在适应性上调整了更高不确定性、低密度区域的数据点。 我们测试了我们的一些基准任务,包括现实世界的热心和不平衡数据集,网络浏览。 我们的实验证实了我们的理论,并展示了在噪音-紫外深度学习中与其他方法相比的重大改进。