Action segmentation is the task of predicting the actions for each frame of a video. As obtaining the full annotation of videos for action segmentation is expensive, weakly supervised approaches that can learn only from transcripts are appealing. In this paper, we propose a novel end-to-end approach for weakly supervised action segmentation based on a two-branch neural network. The two branches of our network predict two redundant but different representations for action segmentation and we propose a novel mutual consistency (MuCon) loss that enforces the consistency of the two redundant representations. Using the MuCon loss together with a loss for transcript prediction, our proposed approach achieves the accuracy of state-of-the-art approaches while being $14$ times faster to train and $20$ times faster during inference. The MuCon loss proves beneficial even in the fully supervised setting.
翻译:行动分割是预测每个视频框架的动作的任务。 获取用于行动分割的视频完整注释费用昂贵, 监管不力, 只能从记录誊本中学习, 具有吸引力。 在本文中, 我们提出一个新的端对端方法, 用于基于双部门神经网络的监管不力的行动分割。 我们网络的两个分支预测了两个多余但不同的行动分割表述, 我们提出一个新的相互一致性损失, 以强制实现两个冗余表述的一致性。 使用 Mucon 损失和记录预测损失, 我们提议的方法实现了最新方法的准确性, 而在推断过程中, 培训速度要快14倍, 速度要快20倍。 Mucon 损失证明即使在完全监督的环境中, 也是有益的 。