We present Language-mediated, Object-centric Representation Learning (LORL), a paradigm for learning disentangled, object-centric scene representations from vision and language. LORL builds upon recent advances in unsupervised object discovery and segmentation, notably MONet and Slot Attention. While these algorithms learn an object-centric representation just by reconstructing the input image, LORL enables them to further learn to associate the learned representations to concepts, i.e., words for object categories, properties, and spatial relationships, from language input. These object-centric concepts derived from language facilitate the learning of object-centric representations. LORL can be integrated with various unsupervised object discovery algorithms that are language-agnostic. Experiments show that the integration of LORL consistently improves the performance of unsupervised object discovery methods on two datasets via the help of language. We also show that concepts learned by LORL, in conjunction with object discovery methods, aid downstream tasks such as referring expression comprehension.
翻译:我们介绍了语言中介、以对象为中心的演示学习(LORL),这是学习视觉和语言中分解的、以物体为中心的场景演示的范例。LORL借鉴了未受监督的物体发现和分割的最新进展,特别是MONet和Slot Cotention。虽然这些算法通过重建输入图像而学习了以物体为中心的表述,但LORL使他们能够进一步学习将所学到的表述与概念相联系,即从语言输入中获取的物体类别、属性和空间关系等单词。从语言中得出的这些以物体为中心的概念有助于学习以物体为中心的演示。LORL可以与各种未受监督的物体发现算法相结合,这种算法是语言不可知性。实验表明,LORLU通过语言帮助,不断改进两个数据集中未受监督的物体发现方法的性能。我们还表明,LORL所学的概念与对象发现方法相结合,有助于下游任务,例如参考表达理解。