A significant gap remains between today's visual pattern recognition models and human-level visual cognition especially when it comes to few-shot learning and compositional reasoning of novel concepts. We introduce Bongard-HOI, a new visual reasoning benchmark that focuses on compositional learning of human-object interactions (HOIs) from natural images. It is inspired by two desirable characteristics from the classical Bongard problems (BPs): 1) few-shot concept learning, and 2) context-dependent reasoning. We carefully curate the few-shot instances with hard negatives, where positive and negative images only disagree on action labels, making mere recognition of object categories insufficient to complete our benchmarks. We also design multiple test sets to systematically study the generalization of visual learning models, where we vary the overlap of the HOI concepts between the training and test sets of few-shot instances, from partial to no overlaps. Bongard-HOI presents a substantial challenge to today's visual recognition models. The state-of-the-art HOI detection model achieves only 62% accuracy on few-shot binary prediction while even amateur human testers on MTurk have 91% accuracy. With the Bongard-HOI benchmark, we hope to further advance research efforts in visual reasoning, especially in holistic perception-reasoning systems and better representation learning.
翻译:今天的视觉模式识别模型和人类视觉认知之间存在显著差距,特别是在小样本学习和组合推理方面。我们引入了一种新的视觉推理基准测试,名为Bongard-HOI,专注于从自然图像中组合学习人-物互动(HOI)。它受到了经典的Bongard问题(BPs)的两种理想特性的启发:1)少样本概念学习,2)上下文相关推理。我们通过难度极高的负样本精心地策划少样本实例,其中正负样本仅在动作标签上存在差异,仅识别物体类别是不足以完成我们的基准测试的。我们还设计了多个测试集,以系统地研究视觉学习模型的泛化能力,其中我们根据少样本实例的训练集和测试集之间的HOI概念重叠情况进行变化,从部分重叠到无重叠。Bongard-HOI对当今的视觉识别模型提出了重大挑战。最先进的HOI检测模型在少样本二值预测方面仅达到62%的准确率,而即使是MTurk上的业余人类测试者的准确率也为91%。通过Bongard-HOI基准测试,我们希望进一步推进视觉推理研究,特别是在整体感知推理系统和更好的表示学习方面。