Weakly-supervised temporal action localization aims to localize action instances in untrimmed videos with only video-level supervision. We witness that different actions record common phases, e.g., the run-up in the HighJump and LongJump. These different actions are defined as conjoint actions, whose rest parts are definite phases, e.g., leaping over the bar in a HighJump. Compared with the common phases, the definite phases are more easily localized in existing researches. Most of them formulate this task as a Multiple Instance Learning paradigm, in which the common phases are tended to be confused with the background, and affect the localization completeness of the conjoint actions. To tackle this challenge, we propose a Joint of Common and Definite phases Network (JCDNet) by improving feature discriminability of the conjoint actions. Specifically, we design a Class-Aware Discriminative module to enhance the contribution of the common phases in classification by the guidance of the coarse definite-phase features. Besides, we introduce a temporal attention module to learn robust action-ness scores via modeling temporal dependencies, distinguishing the common phases from the background. Extensive experiments on three datasets (THUMOS14, ActivityNetv1.2, and a conjoint-action subset) demonstrate that JCDNet achieves competitive performance against the state-of-the-art methods. Keywords: weakly-supervised learning, temporal action localization, conjoint action
翻译:弱监督时间动作定位旨在仅使用视频级别监督在未剪辑视频中定位动作实例。我们观察到不同的动作记录了共同阶段,例如HighJump和LongJump中的起跑。这些不同的动作被定义为联合动作,其其余部分为明确阶段,例如HighJump中的越过杆。与共同阶段相比,明确阶段更容易在现有研究中定位。它们大多被制定为多实例学习范例,其中共同阶段往往会与背景混淆,并影响联合动作的定位完整性。为了解决这一挑战,我们提出了一个共同阶段与明确阶段联合网络(JCDNet),通过提高共同动作特征的可区分性来解决这个问题。具体而言,我们设计了一个类别感知的判别模块,通过粗略的明确阶段特征指导改进建模中共同阶段在分类中的贡献。此外,我们引入了一个时间注意力模块,通过建模时间依赖性学习鲁棒的动作得分,将共同阶段与背景区分开来。对三个数据集(THUMOS14、ActivityNetv1.2和一个联合动作子集)的广泛实验表明,JCDNet在弱监督时间动作本地化方面达到了与现有最先进方法相当的性能。关键词:弱监督学习,时间动作本地化,联合动作