Current captioning datasets, focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize the visual content, they lack in expressing trivial abstract concepts, e.g. "people having a picnic". Such concepts are licensed by human's personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset; a dataset extending 14997 images of the COCO dataset with 134973 human-annotated (high-level) abstract captions collected along three axes: scenes, actions and rationales. We describe and release such dataset and we show how it can be used to assess models' multimodal grounding of abstract concepts and enrich models' visio-lingusitic representations. Moreover, we describe potential tasks enabled by this dataset involving high- and low-level concepts interactions.
翻译:目前的字幕数据集,侧重于以对象为中心的标题,描述图像中的可见物体,结果往往指出显而易见的(对人类),例如“人们在公园里吃食物”。虽然这些数据集有助于评价视觉和语言模型识别视觉内容的能力,但它们缺乏表达微不足道的抽象概念,例如“野餐者”的概念。这些概念得到人类个人经验的许可,有助于形成常识假设。我们介绍了高级数据集;一个数据集,包括COCO数据集的14997图像,其中有134973个(高层次)按三个轴收集的人类注释(高层次)抽象标题:场景、行动和原理。我们描述和发布这类数据集,并展示如何使用这些数据来评估模型的抽象概念的多式地基,并丰富模型的面语言表达方式。此外,我们描述了由该数据集促成的涉及高层次和低层次概念互动的潜在任务。