开放抽样:探索用于重新平衡长期零售数据集的分发外数据 (Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets)

Deep neural networks usually perform poorly when the training dataset suffers from extreme class imbalance. Recent studies found that directly training with out-of-distribution data (i.e., open-set samples) in a semi-supervised manner would harm the generalization performance. In this work, we theoretically show that out-of-distribution data can still be leveraged to augment the minority classes from a Bayesian perspective. Based on this motivation, we propose a novel method called Open-sampling, which utilizes open-set noisy labels to re-balance the class priors of the training dataset. For each open-set instance, the label is sampled from our pre-defined distribution that is complementary to the distribution of original class priors. We empirically show that Open-sampling not only re-balances the class priors but also encourages the neural network to learn separable representations. Extensive experiments demonstrate that our proposed method significantly outperforms existing data re-balancing methods and can boost the performance of existing state-of-the-art methods.

翻译：当培训数据集受到极端阶级不平衡的影响时,深神经网络通常表现不佳。最近的研究发现,以半监督方式进行分配外数据直接培训(即开放式样本)会损害一般化性能。在这项工作中,我们理论上表明,从巴伊西亚角度仍然可以利用分配外数据来扩大少数阶层。基于这一动机,我们提议了一种叫开放型抽样的新颖方法,它利用开放型噪音标签来重新平衡培训数据集的阶级前端。对于每一个开放型实例,标签是从我们预先定义的分布中抽样的,这与原始阶级前端的分布是相辅相成的。我们从经验上表明,开放型抽样不仅可以重新平衡先前的阶级,而且还能鼓励神经网络学习可比较的表述。广泛的实验表明,我们提出的方法大大超越了现有的数据再平衡方法,能够提高现有状态方法的性能。