This work is on training a generative action/video recognition model whose output is a free-form action-specific caption describing the video (rather than an action class label). A generative approach has practical advantages like producing more fine-grained and human-readable output, and being naturally open-world. To this end, we propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. While recently there have been a few attempts to adapt V&L models trained with contrastive learning (e.g. CLIP) for video/action, to the best of our knowledge, we propose the very first method that sets outs to accomplish this goal for a generative model. We firstly show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. To alleviate this, we introduce REST, a training framework consisting of two key components: an unsupervised method for adapting the generative model to action/video by means of pseudo-caption generation and Self-training, i.e. without using any action-specific labels; (b) a Retrieval approach based on CLIP for discovering a diverse set of pseudo-captions for each video to train the model. Importantly, we show that both components are necessary to obtain high accuracy. We evaluate REST on the problem of zero-shot action recognition where we show that our approach is very competitive when compared to contrastive learning-based methods. Code will be made available.
翻译:这项工作是培训一种基因化行动/视频识别模型,其产出是描述视频(而不是一个行动类标签)的免费形式行动专用字幕。一种基因化方法具有实际的优点,例如制作更精细的和人阅读的输出,并且自然地是开放的世界。为此,我们提议修改一个经过事先训练的基因化视觉和语言(V&L)基础模型,用于视频/行动识别。虽然最近曾有几次尝试将经过对比学习(例如CLIP)培训的V&L模型用于视频/行动,但根据我们的知识,我们提议了第一个方法,用以实现这个基因化模型的目标。我们首先表明,直接调整一个基因化模型,以产生行动类,容易过度适应。为了缓解这一点,我们引入了REST,一个培训框架,由两个关键组成部分组成:一种非超强的方法,通过基于模拟的模型生成和自我培训,即不使用任何具体行动类比标签,我们提出了第一个方法,用来实现基因化模型模型的这个方法。我们每个模型的对比性化方法都是用来显示我们所有必要的虚拟化的REAL方法。