Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP's contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP. However, through controlled experiments on a synthetic dataset, we find that CLIP is largely incapable of performing spatial reasoning off-the-shelf. Thus, the second component of ReCLIP is a spatial relation resolver that handles several types of spatial relations. We reduce the gap between zero-shot baselines from prior work and supervised models by as much as 29% on RefCOCOg, and on RefGTA (video game imagery), ReCLIP's relative improvement over supervised ReC models trained on real images is 8%.
翻译:用于新视觉域的参考表达理解( ReC) 培训新视觉域的演示模式要求收集参考表达和可能对应的捆绑框,用于域内的图像。 虽然大规模预先培训模型对于跨域的图像分类有用, 但是仍不清楚能否以零发方式应用到像 ReC 这样的更复杂的任务。 我们展示了RECLIP, 这个简单而有力的零发基准, 用来重新利用最先进的大型模型CLIP。 在ReC 和 CLIP 的对比性培训前目标之间的紧密联系下, RECLIP的第一个组成部分是一个区域分级方法, 通过裁剪除和模糊将目标提议分离出来, 并将其传递到 CLIP 。 然而, 通过对合成数据集的受控实验, 我们发现 CLIP 基本上无法在现场进行空间推理。 因此, RELIP 的第二个组成部分是一个空间关系解答器, 处理多种类型的空间关系。 我们缩小了先前工作零射基线与受监督模型之间的差距, 以真实的RefA image为受监督的 Ref A 的 Ref A vial view 。