Since collecting and annotating data for spatio-temporal action detection is very expensive, there is a need to learn approaches with less supervision. Weakly supervised approaches do not require any bounding box annotations and can be trained only from labels that indicate whether an action occurs in a video clip. Current approaches, however, cannot handle the case when there are multiple persons in a video that perform multiple actions at the same time. In this work, we address this very challenging task for the first time. We propose a baseline based on multi-instance and multi-label learning. Furthermore, we propose a novel approach that uses sets of actions as representation instead of modeling individual action classes. Since computing, the probabilities for the full power set becomes intractable as the number of action classes increases, we assign an action set to each detected person under the constraint that the assignment is consistent with the annotation of the video clip. We evaluate the proposed approach on the challenging AVA dataset where the proposed approach outperforms the MIML baseline and is competitive to fully supervised approaches.
翻译:由于收集和说明用于时空行动探测的数据非常昂贵,因此需要以较少监督的方式学习方法。监管松懈的方法不需要任何捆绑式的框说明,只能从标签上培训,显示是否在视频片段中发生动作。但是,当视频中有多人同时进行多重动作时,目前的方法无法处理案件。在这项工作中,我们第一次处理这一非常具有挑战性的任务。我们提议了一个基于多干预和多标签学习的基线。此外,我们提议了一种新颖的方法,将成套行动用作代表,而不是为单个行动类别建模。由于计算,全电组的概率随着行动类别数量的增加而变得棘手,我们为每个被检测到的人指定一组行动,因为指派任务与视频片段的注意一致。我们评估了关于具有挑战性的AVA数据集的拟议方法,其中拟议的方法比MIML基准要好,并且具有充分监督的方法的竞争力。