In the real world, data tends to follow long-tailed distributions w.r.t. class or attribution, motivating the challenging Long-Tailed Recognition (LTR) problem. In this paper, we revisit recent LTR methods with promising Vision Transformers (ViT). We figure out that 1) ViT is hard to train with long-tailed data. 2) ViT learns generalized features in an unsupervised manner, like mask generative training, either on long-tailed or balanced datasets. Hence, we propose to adopt unsupervised learning to utilize long-tailed data. Furthermore, we propose the Predictive Distribution Calibration (PDC) as a novel metric for LTR, where the model tends to simply classify inputs into common classes. Our PDC can measure the model calibration of predictive preferences quantitatively. On this basis, we find many LTR approaches alleviate it slightly, despite the accuracy improvement. Extensive experiments on benchmark datasets validate that PDC reflects the model's predictive preference precisely, which is consistent with the visualization.
翻译:在现实世界中,数据往往遵循以类或属性为基础的长尾分布,这也激发了令人挑战的长尾识别(LTR)问题。在本文中,我们使用全新的视觉Transformer(ViT)重新审视了最近的LTR方法。我们发现, 1)ViT很难用于长尾数据训练。2)ViT以一种无监督的方式学习广义特征,例如面具生成训练,不论是在长尾还是平衡的数据集上。因此,我们建议采用无监督学习来利用长尾数据。此外,我们提出了预测分布校准(PDC)作为LTR的新指标,其中模型趋向于将输入简单分类为常见类。我们的PDC可以定量地测量预测偏好的模型校准。在此基础上,我们发现许多LTR方法虽然实现了准确性的提高,但它们略微缓解了PDC。在基准数据集上进行的广泛实验验证了PDC可以精确地反映模型的预测偏好,这与可视化结果一致。