CLIP-ReID:在没有具体文本标签的情况下利用图像重新识别的视野-语言模型 (CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels)

Pre-trained vision-language models like CLIP have recently shown superior performances on various downstream tasks, including image classification and segmentation. However, in fine-grained image re-identification (ReID), the labels are indexes, lacking concrete text descriptions. Therefore, it remains to be determined how such models could be applied to these tasks. This paper first finds out that simply fine-tuning the visual model initialized by the image encoder in CLIP, has already obtained competitive performances in various ReID tasks. Then we propose a two-stage strategy to facilitate a better visual representation. The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID and give them to the text encoder to form ambiguous descriptions. In the first training stage, image and text encoders from CLIP keep fixed, and only the text tokens are optimized from scratch by the contrastive loss computed within a batch. In the second stage, the ID-specific text tokens and their encoder become static, providing constraints for fine-tuning the image encoder. With the help of the designed loss in the downstream task, the image encoder is able to represent data as vectors in the feature embedding accurately. The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks. Code is available at https://github.com/Syliz517/CLIP-ReID.

翻译：诸如 CLIP 等经过事先训练的视觉语言模型最近在各种下游任务(包括图像分类和分解)上表现出了优异的成绩。然而,在微细的图像重新定位(ReID)中,标签是索引,缺乏具体的文本描述。因此,这些模型如何应用到这些任务上,还有待决定。在第一阶段培训阶段,CLIP 图像编码器所启动的视觉模型和文字编码器保持固定,只有文字标记能从分批计算对比性损失的刮痕中得到最佳化。然后我们提出一个两阶段的战略,以促进更好的视觉表现。关键思想是通过每ID 的一套可学习文本符号来充分利用 CLIP 的跨模式描述能力,并给予它们文字编码器以形成模糊的描述。在第一阶段,CLIP 的图像和文字编码器只能从分批计算的对比性损失中得到最佳化。在第二阶段,ID 特定文本符号及其编码变得静止,为图像编码的精细化编码提供了限制。在图像编码中提供精确的精细化, 将数据转换成数据格式, 成为了数字格式的索引中的数据特性, 成为了LIPLID 。。在格式任务中可以代表了数据格式中的数据格式中的数据格式化任务。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日