ActionCLIP:承认录像行动的新范例 (ActionCLIP: A New Paradigm for Video Action Recognition)

The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git

翻译：视频动作识别的简单方法要求使用一个神经模型来进行经典和标准的1N多数投票任务。他们受过培训,可以预测一组固定的预定义类别,限制其在隐蔽概念的新数据集上的可转让能力。在本文中,我们通过重视标签文本的语义信息,而不是简单地将它们映射成数字来提供行动识别的新视角。具体地说, 我们将此任务模拟成一个多式学习框架内的视频文本匹配问题, 通过更多的语义语言监管来强化视频表达方式, 并使得我们的模型能够在没有任何进一步的标签数据或参数要求的情况下进行零点动作识别。此外, 为了处理标签文本的缺陷并使用巨大的网络数据, 我们根据这个多式学习框架提出了一个新的行动识别模式, 我们用“ 预调、即时和微调” 文本。这个模式首先从大量网络图像文本或视频文本数据的培训前学到了强有力的表达方式。然后, 它让我们的行动识别任务更像培训前的问题一样, 通过快速的工程, 任何标签文本或参数要求。此外, 我们的端- 方向- 方向- 方向- 方向- 将一个新的动作定位- 动作转换为动作将一个新的动作定位- 格式- 的动作定位- 的动作- 向上, 我们的动作- 将一个高级- 动作- 方向- 的动作- 动作- 动作- 动作- 将一个高级- 动作- 的动作- 向高级- 的动作- 动作- 的动作- 的动作- 的动作- 动作- 动作- 的动作- 的动作- 的动作- 动作- 动作- 动作- 动作- 动作- 动作- 向上到方向- 动作- 的动作- 动作- 向上到方向- 动作- 动作- 向上到方向- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 方向- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 方向- 动作- 动作- 动作- 动作- 动作-

相关内容

多模态学习

关注 44

现实世界中的信息通常以不同的模态出现。例如，图像通常与标签和文本解释联系在一起;文本包含图像以便更清楚地表达文章的主要思想。不同的模态由迥异的统计特性刻画。例如，图像通常表示为特征提取器的像素强度或输出，而文本则表示为离散的词向量。由于不同信息资源的统计特性不同，发现不同模态之间的关系是非常重要的。多模态学习是一个很好的模型，可以用来表示不同模态的联合表示。多模态学习模型也能在观察到的情况下填补缺失的模态。多模态学习模型中，每个模态对应结合了两个深度玻尔兹曼机（deep boltzmann machines）.另外一个隐藏层被放置在两个玻尔兹曼机上层，以给出联合表示。

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日