The task of multimodal referring expression comprehension (REC), aiming at localizing an image region described by a natural language expression, has recently received increasing attention within the research comminity. In this paper, we specifically focus on referring expression comprehension with commonsense knowledge (KB-Ref), a task which typically requires reasoning beyond spatial, visual or semantic information. We propose a novel framework for Commonsense Knowledge Enhanced Transformers (CK-Transformer) which effectively integrates commonsense knowledge into the representations of objects in an image, facilitating identification of the target objects referred to by the expressions. We conduct extensive experiments on several benchmarks for the task of KB-Ref. Our results show that the proposed CK-Transformer achieves a new state of the art, with an absolute improvement of 3.14% accuracy over the existing state of the art.
翻译:旨在将自然语言表达方式描述的图像区域本地化的多式参考表达理解(REC)的任务最近在研究社区内受到越来越多的关注,在本文件中,我们特别侧重于参照普通知识(KB-Ref)的表达理解(KB-Ref),这项任务通常要求超越空间、视觉或语义信息进行推理,我们建议为普通知识增强变异器(CK-Transerectors)建立一个新框架,有效地将普通知识纳入图像中物体的表达,便利识别这些表达方式中提及的目标对象。我们就KB-Ref的任务的若干基准进行了广泛的实验。我们的结果显示,拟议的CK-Transurect 实现了新的艺术状态,比现有艺术状态的精确度提高了3.14%。