Vision-Language models (VLMs) that use contrastive language-image pre-training have shown promising zero-shot classification performance. However, their performance on imbalanced dataset is relatively poor, where the distribution of classes in the training dataset is skewed, leading to poor performance in predicting minority classes. For instance, CLIP achieved only 5% accuracy on the iNaturalist18 dataset. We propose to add a lightweight decoder to VLMs to avoid OOM (out of memory) problem caused by large number of classes and capture nuanced features for tail classes. Then, we explore improvements of VLMs using prompt tuning, fine-tuning, and incorporating imbalanced algorithms such as Focal Loss, Balanced SoftMax and Distribution Alignment. Experiments demonstrate that the performance of VLMs can be further boosted when used with decoder and imbalanced methods. Specifically, our improved VLMs significantly outperforms zero-shot classification by an average accuracy of 6.58%, 69.82%, and 6.17%, on ImageNet-LT, iNaturalist18, and Places-LT, respectively. We further analyze the influence of pre-training data size, backbones, and training cost. Our study highlights the significance of imbalanced learning algorithms in face of VLMs pre-trained by huge data. We release our code at https://github.com/Imbalance-VLM/Imbalance-VLM.
翻译:视觉语言模型(Vision-Language models, VLMs)基于对比语言-图像预训练,在零样本分类任务中表现出极大的性能优势。 然而,它们在不平衡数据集上的表现相对较差,即训练集中的类别分布不均,导致对少数类的分类性能差。例如,CLIP在iNaturalist18数据集上仅实现了5%的准确率。因此,我们提出在VLMs中加入轻量级的解码器,以避免由大量类别引起的OOM(内存不足)问题,同时可以捕捉到较为微妙的小类别特征。然后,我们探讨了使用提示调整、微调以及结合不平衡学习方法(如Focal Loss,Balanced SoftMax和Distribution Alignment)进行VLMs改进的方法。实验证明,在加入解码器和不平衡算法的帮助下,我们改进的VLMs在ImageNet-LT、iNaturalist18和Places-LT上平均准确率分别比零样本分类高出6.58%、69.82%和6.17%。我们还进一步分析了预训练数据大小、主干网络和训练成本的影响。本文强调了在使用大规模预训练数据的VLMs中,不平衡学习算法的重要性。我们将我们的代码发布在https://github.com/Imbalance-VLM/Imbalance-VLM。