Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model's image input size (e.g., 224), subtle details are lost in the feature representation if we input high-resolution images (e.g., 2240). In this work, we introduce an efficient framework that can produce a single feature representation for a high-resolution image that injects image details and shares the same semantic space as the original CLIP. In the framework, we train a feature fusing model based on CLIP features extracted from a carefully designed image patch method that can cover objects of any scale, weakly supervised by image-agnostic class prompted queries. We validate our framework by retrieving images from class prompted queries on the real world and synthetic datasets, showing significant performance improvement on these tasks. Furthermore, to fully demonstrate our framework's detail retrieval ability, we construct a CLEVR-like synthetic dataset called CLVER-DS, which is fully annotated and has a controllable object scale.
翻译:虽然像CLIP一样的视觉语言模型为图像和文字提供了一个功能性共同功能空间,但由于像CILP一样的图像输入大小(例如224)的限制,如果我们输入高分辨率图像(例如2240),在特征表示中会丢失微妙的细节。在这项工作中,我们引入一个高效框架,为高分辨率图像生成单一特征表示,该图像输入图像细节,并分享与原始CLIP相同的语义空间。在这个框架内,我们培训了一个基于从精心设计的CLIP图像补装方法提取的功能的功能化模型,该功能可以覆盖任何规模的物体,在图像认知等级查询的提示下受到微弱监督。我们通过从班级上检索真实世界和合成数据集的图像来验证我们的框架,显示了这些任务的重大性能改进。此外,为了充分展示我们的框架的详细检索能力,我们构建了一个称为CLVER-DS的类似合成数据集,称为CLVER-DS,这是完全附加说明并具有可控制的物体比例。