Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept ``dog'' entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text data. Our results show that MERU learns a highly interpretable representation space while being competitive with CLIP's performance on multi-modal tasks like image classification and image-text retrieval.
翻译:视觉和语言概念自然地组织成一个层次结构,在这个层次结构中,一个文本概念“狗”包括所有包含狗的图像。尽管这种方法很直观,但目前的大规模视觉和语言模型(如CLIP)并没有明确地捕捉这种层次关系。我们提出了MERU,这是一个对图像和文本进行双向对比的模型,可以生成超几何表示。超几何空间具有嵌入树形数据的合适几何属性,因此MERU可以更好地捕捉图像-文本数据中的基础层次结构。我们的结果表明,MERU学习了一个高度可解释的表示空间,同时在图像分类和图像-文本检索等多模式任务上具有与CLIP相竞争的性能。