Recent temporal action segmentation approaches have been very effective. However, most of these approaches need frame annotations to train. These annotations are very expensive and time-consuming to obtain. This limits their performances when only limited annotated data is available. In contrast, we can easily collect a large corpus of in-domain unannotated videos by scavenging through the internet. Thus, this paper proposes an approach for the temporal action segmentation task that can simultaneously leverage knowledge from annotated and unannotated video sequences. Our approach uses multi-stream distillation that repeatedly refines and finally combines their frame predictions. Our model also predicts the action order, which is later used as a temporal constraint while estimating frames labels to counter the lack of supervision for unannotated videos. In the end, our evaluation of the proposed approach on two different datasets demonstrates its capability to achieve comparable performance to the full supervision despite limited annotation.
翻译:近期的行动分解方法非常有效,但是,大多数这些方法都需要经过培训的框架说明。这些说明非常昂贵,而且很费时。当只有有限的附加说明数据时,这些说明限制了它们的性能。相反,我们很容易通过互联网收集大量的内部无附加说明的录像,通过互联网进行清扫。因此,本文件建议了一种时间分解工作的方法,可以同时利用附加说明和无附加说明的视频序列的知识。我们的方法使用了多流分解方法,反复地改进和最终合并其框架预测。我们的模型还预测了行动顺序,后来作为时间限制使用,同时估计框架标签,以抵消对未经附加说明的录像缺乏监督的情况。最后,我们对两个不同数据集的拟议方法的评估表明,尽管注释有限,我们仍有能力实现与全面监督的可比业绩。