The main challenge of Temporal Action Localization is to retrieve subtle human actions from various co-occurring ingredients, e.g., context and background, in an untrimmed video. While prior approaches have achieved substantial progress through devising advanced action detectors, they still suffer from these co-occurring ingredients which often dominate the actual action content in videos. In this paper, we explore two orthogonal but complementary aspects of a video snippet, i.e., the action features and the co-occurrence features. Especially, we develop a novel auxiliary task by decoupling these two types of features within a video snippet and recombining them to generate a new feature representation with more salient action information for accurate action localization. We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features, and then synthesizes a new action-dominated video representation. Extensive experimental results and ablation studies on THUMOS14 and ActivityNet v1.3 demonstrate that our new representation, combined with a simple action detector, can significantly improve the action localization performance.
翻译:时间行动本地化的主要挑战是从各种共生成份(如背景和背景)中检索出微妙的人类行动,如在未经剪辑的视频中,在背景和背景方面,在未经剪辑的视频中,虽然以前的做法通过设计先进的行动探测器取得了实质性进展,但是仍然受到这些共生成份的影响,这些共生成份往往支配视频中的实际行动内容。在本文中,我们探讨了一个视频片段的两个正统但互补的方面,即动作特征和共发性特征。特别是,我们开发了一个新的辅助任务,在视频片片中分离这两类特征,并重新组合这些特征,以产生新的特征,提供更突出的行动信息,促进准确行动本地化。我们称之为我们的方法RefactorNet,它首先明确将行动内容作为因素,规范其共生特征,然后合成一个新的以行动为主的视频代表。关于THUMOOS14和活动Net v1.3的广泛的实验结果和连接研究表明,我们新的代表性,加上简单的行动探测器,可以大大改进行动本地化的绩效。