Long-form video understanding requires designing approaches that are able to temporally localize activities or language. End-to-end training for such tasks is limited by the compute device memory constraints and lack of temporal annotations at large-scale. These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations. Once the video encoder is pre-trained, it is common practice to freeze it during fine-tuning. Therefore, the video encoder does not learn temporal boundaries and unseen classes, causing a domain gap with respect to the downstream tasks. Moreover, using temporally trimmed videos prevents to capture the relations between different action categories and the background context in a video clip which results in limited generalization capacity. To address these limitations, we propose a novel post-pre-training approach without freezing the video encoder which leverages language. We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions. Our experiments show that the proposed approach improves the state-of-the-art on temporal action localization, few-shot temporal action localization, and video language grounding tasks.
翻译:长式视频理解要求设计能够暂时将活动或语言本地化的方法。这类任务的端到端培训受到计算设备内存限制和大规模缺乏时间说明的限制。这些限制可以通过对由课堂说明监管的长短视频的大型数据集进行预先培训来解决。一旦视频编码器经过预先培训,通常的做法是在微调过程中冻结它。因此,视频编码器不会学习时间界限和隐蔽的类别,从而造成与下游任务有关的域差。此外,使用时到端视频无法在视频剪辑中捕捉不同行动类别和背景背景之间的关系,造成一般化能力有限。为了解决这些限制,我们提议采用新的培训前后新办法,但不冻结使用语言的视频编码器。我们引入了一种隐蔽式的学习损失,以字幕形式捕捉活动、背景视频剪辑和语言之间的相对语言关系。我们的实验显示,拟议的方法改进了时间行动本地化、少量时间行动、视频化和语言地面任务。