Modern AI applications involving video, such as video-text alignment, video search, and video captioning, benefit from a fine-grained understanding of video semantics. Existing approaches for video understanding are either data-hungry and need low-level annotation, or are based on general embeddings that are uninterpretable and can miss important details. We propose LASER, a neuro-symbolic approach that learns semantic video representations by leveraging logic specifications that can capture rich spatial and temporal properties in video data. In particular, we formulate the problem in terms of alignment between raw videos and specifications. The alignment process efficiently trains low-level perception models to extract a fine-grained video representation that conforms to the desired high-level specification. Our pipeline can be trained end-to-end and can incorporate contrastive and semantic loss functions derived from specifications. We evaluate our method on two datasets with rich spatial and temporal specifications: 20BN-Something-Something and MUGEN. We demonstrate that our method not only learns fine-grained video semantics but also outperforms existing baselines on downstream tasks such as video retrieval.
翻译:现代涉及视频的人工智能应用,例如视频文本对齐、视频搜索和视频字幕,受益于对视频语义的精细理解。现有的视频理解方法要么需要大量数据和低级注释,要么基于不可解释的通用嵌入,可能会忽略重要细节。我们提出了LASER,这是一种神经符号方法,通过利用可以捕获视频数据中丰富的空间和时间属性的逻辑规范来学习语义视频表示。特别地,我们通过原始视频和规范之间的对齐过程,在术语的对齐问题中,高效地训练低级感知模型,从而提取符合期望高级规范的细粒度视频表示。我们的流水线可以端对端地训练,并且可以包含从规范派生的对比和语义损失函数。我们在具有丰富的空间和时间规范的两个数据集上评估我们的方法:20BN-Something-Something和MUGEN。我们证明了我们的方法不仅学习了细粒度的视频语义,而且在诸如视频检索等下游任务上优于现有的基线。