Incidental supervision from language has become a popular approach for learning generic visual representations that can be prompted to perform many recognition tasks in computer vision. We conduct an in-depth exploration of the CLIP model and show that its visual representation is often strongly biased towards solving some tasks more than others. Moreover, which task the representation will be biased towards is unpredictable, with little consistency across images. To resolve this task bias, we show how to learn a visual prompt that guides the representation towards features relevant to their task of interest. Our results show that these visual prompts can be independent of the input image and still effectively provide a conditioning mechanism to steer visual representations towards the desired task.
翻译:语言的偶发监督已成为一种受欢迎的方法,用于学习通用直观表述,可以促使其执行计算机视觉方面的许多识别任务。我们深入探索CLIP模型,并表明其直观表述往往偏重于解决某些任务。此外,这种表述偏重于不可预测的任务,而图像之间几乎没有一致性。为了消除这一任务偏向,我们展示了如何学习直观提示,引导其呈现与其感兴趣的任务相关的特征。我们的结果显示,这些直观提示可以独立于输入图像,并且仍然有效地提供一个调节机制,将直观表述引向预期任务。