Temporal action segmentation is crucial for understanding long-form videos. Previous works on this task commonly adopt an iterative refinement paradigm by using multi-stage models. Our paper proposes an essentially different framework via denoising diffusion models, which nonetheless shares the same inherent spirit of such iterative refinement. In this framework, action predictions are progressively generated from random noise with input video features as conditions. To enhance the modeling of three striking characteristics of human actions, including the position prior, the boundary ambiguity, and the relational dependency, we devise a unified masking strategy for the conditioning inputs in our framework. Extensive experiments on three benchmark datasets, i.e., GTEA, 50Salads, and Breakfast, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action segmentation. Our codes will be made available.
翻译:时间上的动作分割对于理解长格式视频至关重要。以往在该任务上的工作通常采用多阶段模型的迭代改进范式。我们的论文通过去噪扩散模型提出了一个基本上不同的框架,然而它仍然具有这样迭代改进的内在精神。在这个框架下,动作预测是从输入视频特征作为条件的随机噪声中逐步生成的。为了增强模拟人类动作的三个显著特征,包括位置先验、边界模糊和依赖关系,我们为我们框架中的条件输入设计了一个统一的掩模策略。在三个基准数据集GTEA、50Salads和Breakfast上进行了广泛的实验证明,所提出的方法实现了优越或相当的结果,展示了一种用于动作分割的生成方法的有效性。我们的代码将被提供。