iCAR:图像分类和图像-文字对齐对视识别 (iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition)

Image classification, which classifies images by pre-defined categories, has been the dominant approach to visual representation learning over the last decade. Visual learning through image-text alignment, however, has emerged to show promising performance, especially for zero-shot recognition. We believe that these two learning tasks are complementary, and suggest combining them for better visual learning. We propose a deep fusion method with three adaptations that effectively bridge two learning tasks, rather than shallow fusion through naive multi-task learning. First, we modify the previous common practice in image classification, a linear classifier, with a cosine classifier which shows comparable performance. Second, we convert the image classification problem from learning parametric category classifier weights to learning a text encoder as a meta network to generate category classifier weights. The learnt text encoder is shared between image classification and image-text alignment. Third, we enrich each class name with a description to avoid confusion between classes and make the classification method closer to the image-text alignment. We prove that this deep fusion approach performs better on a variety of visual recognition tasks and setups than the individual learning or shallow fusion approach, from zero-shot/few-shot image classification, such as the Kornblith 12-dataset benchmark, to downstream tasks of action recognition, semantic segmentation, and object detection in fine-tuning and open-vocabulary settings. The code will be available at https://github.com/weiyx16/iCAR.

翻译：图像分类将图像按预定义的类别分类,这是过去十年来视觉代表学习的主要方法。然而,通过图像-文本校正的视觉学习显示有希望的性能, 特别是在零光识别方面。我们相信, 这两项学习任务是互补的, 并且建议将它们结合起来, 以便更好的视觉学习。我们建议了一种深度融合方法, 有三个适应方法可以有效地连接两个学习任务, 而不是通过天真的多任务学习进行浅质的融合。首先, 我们修改了先前在图像分类方面的常见做法, 一个线性分类器, 并有一个显示可比性能的 Cosine 分类器。其次, 我们把图像分类问题从学习参数分类器分类器的重量转变为学习文本编码器, 作为生成分类器重量的元网络。学习过的文本编码器在图像分类和图像文本校正校正之间是共享的。第三, 我们用一个描述来丰富每个班级的名称, 以避免班级之间的混淆, 并使分类方法更接近图像文本校正校正校正校正的分类和设置方法比个人学习目的或浅级分类方法要好得多。从零光标/ / 标准级的图像检测, 的排序的排序, 的排序分级的可理解,, 的可理解, 的可理解的可理解的可理解的可操作的可操作的可操作的可操作。