Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.
翻译:视频嵌入器是将语义学注入视觉表现形式的一个充满希望的渠道,但现有方法仅能捕捉短短短视频片段及其随带文本之间的短期关联。 我们提议HierVL, 这是一种新型的等级级视频嵌入器,同时同时同时兼顾长期和短期协会。作为培训数据,我们拍摄视频的同时,还附有时间戳印的人类行动的文字描述,以及整个长视频活动(Ego4D中提供)的高层次文本摘要。我们引入了一个等级对比式培训目标,鼓励在短片和视频层面进行文字视频调整。虽然剪动级限制使用逐步描述来记录当前正在发生的情况,但视频层面的限制使用摘要文本来捕捉其发生的原因,即行为者的活动和意图的更宽广背景。我们的分级计划产生了一个剪辑,它超越了单级对应方,以及一个长期视频代表,在需要长期视频建模的任务上取得SotA结果。 高端VL成功地将高端视频图像转换到多端的下游任务(ErV100ST-TIS, ) 和高端电子-HAR-SIT-TIS-THAT-HAT-TIS-HAT-HAT-HAT-THAT-HAT-THAT-HIT-HAT-HAT-HI-HI-THI-T-HI-T-T-THI) 的多重任务。