Deep learning-based models encounter challenges when processing long-tailed data in the real world. Existing solutions usually employ some balancing strategies or transfer learning to deal with the class imbalance problem, based on the image modality. In this work, we present a visual-linguistic long-tailed recognition framework, termed VL-LTR, and conduct empirical studies on the benefits of introducing text modality for long-tailed recognition (LTR). Compared to existing approaches, the proposed VL-LTR has the following merits. (1) Our method can not only learn visual representation from images but also learn corresponding linguistic representation from noisy class-level text descriptions collected from the Internet; (2) Our method can effectively use the learned visual-linguistic representation to improve the visual recognition performance, especially for classes with fewer image samples. We also conduct extensive experiments and set the new state-of-the-art performance on widely-used LTR benchmarks. Notably, our method achieves 77.2% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points, and is close to the prevailing performance training on the full ImageNet. Code is available at https://github.com/ChangyaoTian/VL-LTR.
翻译:在现实世界中处理长尾数据时,深层次的学习模式会遇到挑战。现有解决方案通常采用一些平衡战略或传授学习方法,以便根据图像模式处理阶级不平衡问题。在这项工作中,我们提出了一个视觉语言长尾长期识别框架(称为VL-LTR),并就引入长尾识别文本模式的好处进行经验研究。与现有方法相比,拟议的VL-LTR具有以下优点:(1) 我们的方法不仅可以从图像中学习视觉表现,还可以从从从从互联网收集的吵闹的阶级层次文字描述中学习相应的语言代表;(2) 我们的方法可以有效地利用所学的视觉语言代表来提高视觉识别性,特别是图像样本较少的班级。我们还进行了广泛的实验,并将新的艺术状态表现设定在广泛使用的LTR基准上。 值得注意的是,我们的方法在图像网络-LTT上实现了77.2%的总体准确度,大大超过17个点,而且接近整个图像-TRNet上的现有业绩培训。代码可在 https://giuthhu-Lubus.com查阅。