Supervised approaches for learning spatio-temporal scene graphs (STSG) from video are greatly hindered due to their reliance on STSG-annotated videos, which are labor-intensive to construct at scale. Is it feasible to instead use readily available video captions as weak supervision? To address this question, we propose LASER, a neuro-symbolic framework to enable training STSG generators using only video captions. LASER employs large language models to first extract logical specifications with rich spatio-temporal semantic information from video captions. LASER then trains the underlying STSG generator to align the predicted STSG with the specification. The alignment algorithm overcomes the challenges of weak supervision by leveraging a differentiable symbolic reasoner and using a combination of contrastive, temporal, and semantics losses. The overall approach efficiently trains low-level perception models to extract a fine-grained STSG that conforms to the video caption. In doing so, it enables a novel methodology for learning STSGs without tedious annotations. We evaluate our method on three video datasets: OpenPVSG, 20BN, and MUGEN. Our approach demonstrates substantial improvements over fully-supervised baselines, achieving a unary predicate prediction accuracy of 27.78% (+12.65%) and a binary recall@5 of 0.42 (+0.22) on OpenPVSG. Additionally, LASER exceeds baselines by 7% on 20BN and 5.2% on MUGEN in terms of overall predicate prediction accuracy.
翻译:基于监督的方法从视频中学习时空场景图(STSG)面临重大阻碍,因其依赖标注STSG的视频,而大规模构建此类标注数据极为耗时。能否转而利用现成的视频描述作为弱监督?针对这一问题,我们提出LASER,一种神经符号框架,仅使用视频描述即可训练STSG生成器。LASER首先利用大语言模型从视频描述中提取具有丰富时空语义信息的逻辑规约,随后训练底层STSG生成器以使其预测的STSG与规约对齐。该对齐算法通过结合可微分符号推理器,并采用对比损失、时序损失和语义损失的组合,克服了弱监督带来的挑战。整体方法高效训练了低层感知模型,以提取符合视频描述的细粒度STSG。由此,它实现了一种无需繁琐标注即可学习STSG的新方法。我们在三个视频数据集上评估了该方法:OpenPVSG、20BN和MUGEN。我们的方法相比全监督基线展现出显著提升,在OpenPVSG上实现了一元谓词预测准确率27.78%(提升12.65%)和二元谓词召回率@5为0.42(提升0.22)。此外,在整体谓词预测准确率上,LASER在20BN上超出基线7%,在MUGEN上超出5.2%。