Action understanding has evolved into the era of fine granularity, as most human behaviors in real life have only minor differences. To detect these fine-grained actions accurately in a label-efficient way, we tackle the problem of weakly-supervised fine-grained temporal action detection in videos for the first time. Without the careful design to capture subtle differences between fine-grained actions, previous weakly-supervised models for general action detection cannot perform well in the fine-grained setting. We propose to model actions as the combinations of reusable atomic actions which are automatically discovered from data through self-supervised clustering, in order to capture the commonality and individuality of fine-grained actions. The learnt atomic actions, represented by visual concepts, are further mapped to fine and coarse action labels leveraging the semantic label hierarchy. Our approach constructs a visual representation hierarchy of four levels: clip level, atomic action level, fine action class level and coarse action class level, with supervision at each level. Extensive experiments on two large-scale fine-grained video datasets, FineAction and FineGym, show the benefit of our proposed weakly-supervised model for fine-grained action detection, and it achieves state-of-the-art results.
翻译:行动理解已经演变为微粒化时代,因为现实生活中大多数人类行为只有细微差异。为了以贴合标签效率的方式准确地检测这些微粒化行动,我们首次解决了视频中微粒化微粒化行动探测不力的问题。没有仔细设计以捕捉微粒化行动之间的细微差异,以往一般行动检测的受微粒化监督模式无法在微粒化环境中很好地发挥作用。我们提议模拟行动,作为通过自我监督的组合,从数据中自动发现的可再使用原子行动的组合,以便捕捉微粒化行动的共性和个性。以视觉概念为代表的所学会的原子行动进一步被描绘成微粒化和粗略的行动标签。我们的方法构建了四个层次的视觉代表等级:剪辑、原子行动等级、精密行动等级和粗糙的行动等级,并在各级进行监管。对两个大型精密的微图像模型、精密行动和精细检测结果进行了广泛的实验,展示了国家行动的好处。