This paper investigates the problem of zero-shot action recognition, in the setting where no training videos with seen actions are available. For this challenging scenario, the current leading approach is to transfer knowledge from the image domain by recognizing objects in videos using pre-trained networks, followed by a semantic matching between objects and actions. Where objects provide a local view on the content in videos, in this work we also seek to include a global view of the scene in which actions occur. We find that scenes on their own are also capable of recognizing unseen actions, albeit more marginally than objects, and a direct combination of object-based and scene-based scores degrades the action recognition performance. To get the best out of objects and scenes, we propose to construct them as a Cartesian product of all possible compositions. We outline how to determine the likelihood of object-scene compositions in videos, as well as a semantic matching from object-scene compositions to actions that enforces diversity among the most relevant compositions for each action. While simple, our composition-based approach outperforms object-based approaches and even state-of-the-art zero-shot approaches that rely on large-scale video datasets with hundreds of seen actions for training and knowledge transfer.
翻译:本文调查了零点动作识别问题, 在没有显示动作的培训视频的环境下, 不存在零点动作识别问题。 对于这一具有挑战性的场景, 目前的主要办法是通过使用预先训练的网络, 识别视频中的对象, 并随后对对象和行动进行语义匹配, 从图像领域转移知识。 当对象对视频内容提供本地观点时, 在这项工作中, 我们还寻求包含对所发生行动的场景的全局观。 我们发现, 场景本身也能够识别看不见的行动, 尽管比对象更微不足道, 并且基于对象的分数和基于场景的分数的直接组合会降低动作识别的性能。 为了从对象和场景中找到最佳的分数, 我们提议将其构建为所有可能构成的卡斯特恩产品。 我们概述了如何确定视频中天窗构成的可能性, 以及从对象- 屏幕构成到执行每项行动最相关构成中的多样性的行动的语义匹配。 我们的构成方法虽然简单,, 我们的组合方法会超越基于对象的方位和基于状态的分数的分数, 降低了动作的分数。 我们建议把它们建成成成成一个全局式的全局式的全局式的零镜头, 。