The temporal action segmentation task segments videos temporally and predicts action labels for all frames. Fully supervising such a segmentation model requires dense frame-wise action annotations, which are expensive and tedious to collect. This work is the first to propose a Constituent Action Discovery (CAD) framework that only requires the video-wise high-level complex activity label as supervision for temporal action segmentation. The proposed approach automatically discovers constituent video actions using an activity classification task. Specifically, we define a finite number of latent action prototypes to construct video-level dual representations with which these prototypes are learned collectively through the activity classification training. This setting endows our approach with the capability to discover potentially shared actions across multiple complex activities. Due to the lack of action-level supervision, we adopt the Hungarian matching algorithm to relate latent action prototypes to ground truth semantic classes for evaluation. We show that with the high-level supervision, the Hungarian matching can be extended from the existing video and activity levels to the global level. The global-level matching allows for action sharing across activities, which has never been considered in the literature before. Extensive experiments demonstrate that our discovered actions can help perform temporal action segmentation and activity recognition tasks.
翻译:时间行动分解任务段段的视频时间, 并预测所有框架的动作标签 。 充分监督这种分解模式需要密集的框架框架化行动说明, 这要求收集昂贵和烦琐的动作说明 。 这项工作是第一个提出组成行动发现( CAD) 框架的建议, 只需要视频高层次的复杂活动标签作为时间行动分解的监督。 提议的方法会自动发现包含视频的行动, 使用活动分类任务 。 具体地说, 我们定义了一定数量的潜在行动原型, 以构建视频级双重代表, 通过活动分类培训集体学习这些原型 。 这样设定了我们的方法, 能够发现多个复杂活动之间潜在的共同行动 。 由于缺少行动一级的监督, 我们采用了匈牙利匹配算法, 将潜在行动原型与实地真相分解类别联系起来, 用于评估 。 我们显示, 通过高层监督, 匈牙利的匹配可以从现有的视频和活动级别扩大到全球一级。 全球一级的匹配允许在活动中共享行动, 而这在文献中从未被考虑过。 广泛的实验表明, 我们发现的行动可以帮助进行时间分解的活动和识别活动。