Action segmentation is the task of temporally segmenting every frame of an untrimmed video. Weakly supervised approaches to action segmentation, especially from transcripts have been of considerable interest to the computer vision community. In this work, we focus on two aspects of the use and evaluation of weakly supervised action segmentation approaches that are often overlooked: the performance variance over multiple training runs and the impact of selecting feature extractors for this task. To tackle the first problem, we train each method on the Breakfast dataset 5 times and provide average and standard deviation of the results. Our experiments show that the standard deviation over these repetitions is between 1 and 2.5% and significantly affects the comparison between different approaches. Furthermore, our investigation on feature extraction shows that, for the studied weakly-supervised action segmentation methods, higher-level I3D features perform worse than classical IDT features.
翻译:行动分割是将未剪辑的视频的每个框架进行时间分割的任务。 计算机视觉界相当关注对行动分割,特别是记录誊本的监管不力的做法。 在这项工作中,我们侧重于使用和评价经常被忽视的未受监管的行动分割方法的两个方面:多重培训运行的性能差异和为这项任务选择特征提取器的影响。为了解决第一个问题,我们用早餐数据集对每种方法进行了5次培训,并提供了平均和标准偏差结果。我们的实验表明,这些重复的标准偏差在1至2.5%之间,严重影响了不同方法之间的比较。此外,我们对特征提取的调查表明,对于所研究的受监管不力的行动分割方法而言,高层次的I3D特征比传统的IDT特征差。