With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.
翻译:随着大规模、真实世界数据集的迅速增加,解决长期详细数据分配问题变得至关重要(即,少数类别占大多数数据,而大多数类别则代表不足)。现有解决方案通常采取类再平衡战略,如根据对每一类的观测数量进行重新抽样和重新加权。在这项工作中,我们争辩说,随着抽样数量的增加,新增加的数据点的额外好处将减少。我们引入了一个新的理论框架,通过将每个样本与一个小相邻区域而不是单一点联系起来来衡量数据重叠。有效的样本数量被定义为样本数量,可以用简单的公式($1\beta})/(1\beta)进行计算,其中美元是样品数量, $\beta =in [0,1]美元是超参数。我们设计了一个重新加权计划,将每个样本的有效数量用于重新平衡损失,从而产生一个类平衡损失。在进行大规模数据更新时,我们经过培训的图像网络将用大量数据进行全面测试,包括长期数据模拟,然后通过模拟数据,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过长期的数据,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,通过模拟,模拟,模拟,模拟,模拟,模拟,模拟,通过模拟,模拟,通过模拟,模拟,模拟,通过模拟,进行,进行,模拟,进行,进行,模拟,通过模拟,通过模拟,通过模拟,模拟,通过模拟,通过模拟,模拟,通过模拟,进行,模拟,通过模拟,通过模拟,进行,进行,进行,进行,模拟,通过模拟,进行,模拟,进行,进行,通过模拟,进行,进行,进行,进行,进行,进行,进行,进行,模拟,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行,进行