What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability. Code: \url{https://git.io/J1HPY}.
翻译:是什么构成对象? 这是计算机视觉中长期存在的一个问题。 为了实现这一目标,我们开发了许多无学习和以学习为基础的方法,以达到目标。 但是,它们一般没有在新的领域和新对象中进行适当的规模。 在本文件中,我们主张现有方法缺乏由人类无法理解的语义学管理的自上而下监督信号。我们在文献中首次展示了由图像-文字对齐的多模式视野变异器(MViT)培训能够有效地弥合这一差距。我们在各个领域和新对象的广泛实验显示了MViTs在图像中将通用对象本地化方面的最先进的表现。基于以下观察,即现有的MViTs并不包含多尺度的特性处理,通常需要更长的培训时间表。我们开发了一个高效的MViT结构,使用多尺度的畸变注意力和迟缓的视觉语言融合。我们展示了MViT建议在各种应用中的重要性,包括开放-世界天体探测、突出的和迷彩天体探测、监督和自我监控的探测任务。此外,MViT{ViGils可以提供更高级的版本。