We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from \charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities.
翻译:我们引入了人类活动规划的新视觉语言基准VilPAct,这是人类活动规划的新颖的视觉语言基准。它设计用于一项任务,即包含的AI剂能够根据关于人类初始活动和文字意图的视频剪辑来解释和预测人类的未来行动。数据集由来自\ changades的2.9k视频组成,这些视频通过众包、多选择问题测试集和四个强有力的基线来扩展意图。其中一个基线在多模式知识库(MKB)的基础上采用了神经同步方法,而其他的则是根据最新的艺术状态(SOTA)方法改编的深层基因模型。根据我们的广泛实验,关键的挑战在于组成和有效利用两种模式的信息。