From a visual scene containing multiple people, human is able to distinguish each individual given the context descriptions about what happened before, their mental/physical states or intentions, etc. Above ability heavily relies on human-centric commonsense knowledge and reasoning. For example, if asked to identify the "person who needs healing" in an image, we need to first know that they usually have injuries or suffering expressions, then find the corresponding visual clues before finally grounding the person. We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions. We further create a benchmark, HumanCog, a dataset with 130k grounded commonsensical descriptions annotated on 67k images, covering diverse types of commonsense and visual scenes. We set up a context-object-aware method as a strong baseline that outperforms previous pre-trained and non-pretrained models. Further analysis demonstrates that rich visual commonsense and powerful integration of multi-modal commonsense are essential, which sheds light on future works. Data and code will be available https://github.com/Hxyou/HumanCog.
翻译:从包含多人的视觉场景看,人类能够区分每个人,因为其背景描述涉及以前发生的事情、其精神/生理状态或意图等等。 超能力在很大程度上依赖于以人为中心的常识和推理。 例如,如果被要求确定图像中的“需要愈合的人”的“需要愈合的人”的身份,我们需要首先知道他们通常有伤害或痛苦的表情,然后在最终确定一个人之前找到相应的视觉线索。 我们提出了一个新的常识任务,即以人为中心的常识基础,测试模型根据背景描述之前发生的事情、其精神/生理状态或意图对个人的定位能力。我们进一步建立一个基准,即HumanCog,这是一个有130公里有基底常识描述的数据集,对67k图像作了附加说明,涵盖各种常识和视觉场景。我们设置了一种上下半视线方法,作为超越了以前受过训练的和未受过训练的模型的强有力的基准。进一步分析表明,丰富的视觉常识和多模式的强大整合是不可或缺的,这为未来工作提供了光谱/HUM/HR。 数据和代码将可供使用。