Zero-shot action recognition is the task of recognizingaction classes without visual examples, only with a seman-tic embedding which relates unseen to seen classes. Theproblem can be seen as learning a function which general-izes well to instances of unseen classes without losing dis-crimination between classes. Neural networks can modelthe complex boundaries between visual classes, which ex-plains their success as supervised models. However, inzero-shot learning, these highly specialized class bound-aries may not transfer well from seen to unseen classes.In this paper we propose a centroid-based representation,which clusters visual and semantic representation, consid-ers all training samples at once, and in this way generaliz-ing well to instances from unseen classes. We optimize theclustering using Reinforcement Learning which we show iscritical for our approach to work. We call the proposedmethod CLASTER and observe that it consistently outper-forms the state-of-the-art in all standard datasets, includ-ing UCF101, HMDB51 and Olympic Sports; both in thestandard zero-shot evaluation and the generalized zero-shotlearning. Further, we show that our model performs com-petitively in the image domain as well, outperforming thestate-of-the-art in many settings.
翻译:零点行动识别是承认行动类别的任务,没有视觉实例,只有以隐蔽的语义嵌入,与被观察的阶级相关。问题可以被视为学习一种功能,它能够将普通的功能与不可见的阶级的情况相匹配,而不会在阶级之间造成分化。神经网络可以模拟视觉阶级之间的复杂界限,而这种分类作为成功的监督模式是例外的。然而,在零点学习中,这些高度专业化的阶级界限可能不会从可见的类别向看不见的类别转移。在本文中,我们提出一个以中位机器人为基础的代表,该代表将视觉和语义代表组合在一起,所有样本都一次性地培训,并以此方式使普通的样本与不可见的班级相比。我们优化了使用强化学习的集群,我们展示了这种组合对于我们工作方法的批评力。我们叫首席法律专家,并观察到它始终超越所有标准数据集中的最新状态,包括UCF101、HMDB51和奥林匹克运动;在标准零点评估和普遍的零点学习中,我们共同展示了许多的图像。