Video action segmentation aims to slice the video into several action segments. Recently, timestamp supervision has received much attention due to lower annotation costs. We find the frames near the boundaries of action segments are in the transition region between two consecutive actions and have unclear semantics, which we call ambiguous intervals. Most existing methods iteratively generate pseudo-labels for all frames in each video to train the segmentation model. However, ambiguous intervals are more likely to be assigned with noisy and incorrect pseudo-labels, which leads to performance degradation. We propose a novel framework to train the model under timestamp supervision including the following two parts. First, pseudo-label ensembling generates pseudo-label sequences with ambiguous intervals, where the frames have no pseudo-labels. Second, iterative clustering iteratively propagates the pseudo-labels to the ambiguous intervals by clustering, and thus updates the pseudo-label sequences to train the model. We further introduce a clustering loss, which encourages the features of frames within the same action segment more compact. Extensive experiments show the effectiveness of our method.
翻译:视频动作偏移旨在将视频切成多个动作部分。 最近, 时间戳监督由于注释成本的降低而引起很多关注。 我们发现, 靠近行动边界部分的框架处于两个连续动作之间的过渡区域, 语义模糊, 我们称之为模糊的间隔。 大多数现有方法都迭代生成假标签, 用于每个视频中的所有框架, 以训练分化模型。 但是, 模糊的间隔更有可能被分配为噪音和不正确的伪标签, 从而导致性能退化。 我们提出了一个新的框架, 在时间戳监督下对模型进行培训, 包括以下两个部分 。 首先, 假标签聚合生成的假标签序列间隔模糊, 其间间隔没有假标签 。 第二, 迭代组合会通过集束将假标签迭代向模糊间隔传播, 从而更新假标签序列来训练模型 。 我们进一步引入一个集群损失, 从而鼓励同一动作段内框架的特征更加紧凑。 广泛的实验显示了我们的方法的有效性 。