Humans have remarkable capacity to reason abductively and hypothesize about what lies beyond the literal content of an image. By identifying concrete visual clues scattered throughout a scene, we almost can't help but draw probable inferences beyond the literal scene based on our everyday experience and knowledge about the world. For example, if we see a "20 mph" sign alongside a road, we might assume the street sits in a residential area (rather than on a highway), even if no houses are pictured. Can machines perform similar visual reasoning? We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents. We adopt a free-viewing paradigm: participants first observe and identify salient clues within images (e.g., objects, actions) and then provide a plausible inference about the scene, given the clue. In total, we collect 363K (clue, inference) pairs, which form a first-of-its-kind abductive visual reasoning dataset. Using our corpus, we test three complementary axes of abductive reasoning. We evaluate the capacity of models to: i) retrieve relevant inferences from a large candidate corpus; ii) localize evidence for inferences via bounding boxes, and iii) compare plausible inferences to match human judgments on a newly-collected diagnostic corpus of 19K Likert-scale judgments. While we find that fine-tuning CLIP-RN50x64 with a multitask objective outperforms strong baselines, significant headroom exists between model performance and human agreement. Data, models, and leaderboard available at http://visualabduction.com/
翻译:人类有非凡的能力, 能够对图像的字面内容以外的内容进行感知和假设。 通过辨别分散在一幕中的具体视觉线索, 我们几乎不能不在字面现场之外进行可能的推理。 例如, 如果我们看到一条路旁的“ 20 mph” 标志, 我们可能会假设街道在住宅区( 而不是高速公路上), 即使没有房屋, 机器能够执行类似的视觉推理 。 我们提出夏洛克, 一套103K 的附加图像, 用于测试机器的绑架推理能力, 范围超越了一幕图像的内容。 我们采取了一种自由观的范式: 参与者首先观察和辨识图像( 例如, 物体, 动作) 范围外的突出线索, 然后根据线索, 就可以对场景做出一个可信的推理。 总的来说, 我们收集了363K( 铜板, 颜色, 颜色, 颜色, 颜色, 颜色, 范围很像头的视觉推理 。 我们用我们的本体, 测试三个相近的缩模型, 角度推理判, 。