A significant gap remains between today's visual pattern recognition models and human-level visual cognition especially when it comes to few-shot learning and compositional reasoning of novel concepts. We introduce Bongard-HOI, a new visual reasoning benchmark that focuses on compositional learning of human-object interactions (HOIs) from natural images. It is inspired by two desirable characteristics from the classical Bongard problems (BPs): 1) few-shot concept learning, and 2) context-dependent reasoning. We carefully curate the few-shot instances with hard negatives, where positive and negative images only disagree on action labels, making mere recognition of object categories insufficient to complete our benchmarks. We also design multiple test sets to systematically study the generalization of visual learning models, where we vary the overlap of the HOI concepts between the training and test sets of few-shot instances, from partial to no overlaps. Bongard-HOI presents a substantial challenge to today's visual recognition models. The state-of-the-art HOI detection model achieves only 62% accuracy on few-shot binary prediction while even amateur human testers on MTurk have 91% accuracy. With the Bongard-HOI benchmark, we hope to further advance research efforts in visual reasoning, especially in holistic perception-reasoning systems and better representation learning.
翻译:今天的视觉模式识别模型和人类层面的视觉认知模型之间仍然存在着巨大差距,特别是在关于新概念的微小的学习和构思推理方面。我们引入了Bongard-HoI,这是一个新的视觉推理基准,其重点是自然图像中人类物体相互作用(HOIs)的构思学习;它受古老的Bongard问题(BPs)的两个理想特征的启发:(1) 略微的理念学习,和(2) 以背景为依据的推理。我们谨慎地用硬底片来校正反的微小例子,其中正面和负面图像只对行动标签有分歧,而仅仅承认对象类别不足以完成我们的基准。我们还设计了多个测试组,系统地研究视觉学习模型的一般化模型,我们从部分到零重叠,将HOI概念概念概念的重合起来。Bongard-HOI对今天的视觉识别模型提出了重大挑战。最先进的HOI探测模型只达到62%的精准度,而即使MTurk的业余人测试师则有91 %的精确度,我们更精确度的视觉推感测。我们更深的Bang-hisalalalisal。