Daily images may convey abstract meanings that require us to memorize and infer profound information from them. To encourage such human-like reasoning, in this work, we teach machines to predict where and when it was taken rather than performing basic tasks like traditional segmentation or classification. Inspired by Horn's QR theory, we designed a novel QR-CLIP model consisting of two components: 1) the Quantity module first retrospects more open-world knowledge as the candidate language inputs; 2) the Relevance module carefully estimates vision and language cues and infers the location and time. Experiments show our QR-CLIP's effectiveness, and it outperforms the previous SOTA on each task by an average of about 10% and 130% relative lift in terms of location and time reasoning. This study lays a technical foundation for location and time reasoning and suggests that effectively introducing open-world knowledge is one of the panaceas for the tasks.
翻译:每天的图像可能传递抽象含义,要求我们记住和从中推断出深刻的信息。为了鼓励这种人性推理,我们在这项工作中教机器预测何时何地被拿走,而不是执行传统分割或分类等基本任务。在霍恩的QR理论的启发下,我们设计了一个新型的QR-CLIP模型,由两个组成部分组成:(1)数量模块首先将开放世界的知识视为候选语言投入;(2)相关性模块仔细估计了视觉和语言提示,并推断了位置和时间。实验显示我们QR-CLIP的有效性,在位置和时间推理方面,它比前SOTA平均高出10%和130%。这项研究为地点和时间推理奠定了技术基础,并表明有效引入开放世界知识是任务万灵药之一。