Poken Moments:通过视频描述学习联合视听演示 (Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions)

When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video. These descriptions can be captured in captions that provide expanded attributes for video labeling (e.g. actions/objects/scenes/sentiment/etc.) while allowing us to gain new insight into what people find important or necessary to summarize specific events. Existing caption datasets for video understanding are either small in scale or restricted to a specific domain. To address this, we present the Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We collect our descriptions using audio recordings to ensure that they remain as natural and concise as possible while allowing us to scale the size of a large classification dataset. In order to utilize our proposed dataset, we present a novel Adaptive Mean Margin (AMM) approach to contrastive learning and evaluate our models on video/caption retrieval on multiple datasets. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.

翻译：当人们观察事件时,他们能够抽取关键信息,并简要总结正在发生的情况。这些摘要包括背景和语义信息,描述所观察事件的重要高层次细节(什么、在哪里、谁和如何),排除观察员认为无关紧要的背景资料。考虑到这一点,人们为不同动态活动的视频制作的描述可以极大地增进我们对每个视频中感兴趣的关键信息的理解。这些描述可以记录在为视频标签提供扩大属性的字幕中(例如,行动/目标/评论/情绪/信息),同时允许我们重新了解人们发现哪些重要或必要内容来总结具体事件。现有的用于视频理解的字幕数据集规模较小,或者局限于某个特定领域。为了解决这个问题,我们介绍Spoken Mocen Mocion(S-MIT)的数据集,每个都归因于一个独特的简短视频短片段,描述广泛的不同事件。我们用录音收集我们的描述,以确保它们保持自然和简洁性,同时允许我们在经过培训的模型上进行更精确的图像检索。我们用的是,在不断改进的图像模型上,我们用我们建议的图像模型来改进我们的数据。