The video action segmentation task is regularly explored under weaker forms of supervision, such as transcript supervision, where a list of actions is easier to obtain than dense frame-wise labels. In this formulation, the task presents various challenges for sequence modeling approaches due to the emphasis on action transition points, long sequence lengths, and frame contextualization, making the task well-posed for transformers. Given developments enabling transformers to scale linearly, we demonstrate through our architecture how they can be applied to improve action alignment accuracy over the equivalent RNN-based models with the attention mechanism focusing around salient action transition regions. Additionally, given the recent focus on inference-time transcript selection, we propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time. Furthermore, we subsequently demonstrate how this approach can also improve the overall segmentation performance. Finally, we evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers and the importance of transcript selection on this video-driven weakly-supervised task.
翻译:视频行动分化任务在较弱的监督形式下定期探讨,例如记录单监督,在这种监督形式下,获得行动清单比密集的框架标签更容易。在这一表述中,由于强调行动过渡点、长序列长度和框架背景化,任务对序列建模方法提出了各种挑战,使变压器的任务得到妥善安排。鉴于使变压器能够线性缩放的动态,我们通过我们的架构展示了如何应用它们来提高类似以RNN为基础的模型的行动协调准确性,而关注机制则侧重于突出的行动过渡区域。此外,鉴于最近侧重于推断时间抄录选择,我们提议采用补充记录单嵌入方法,以便在推断时间更快地选择记录誊本。此外,我们随后展示了这一方法如何也能改善总体分化绩效。最后,我们评估了我们提出的跨基准数据集的方法,以更好地了解变压器的可适用性,以及在这一由视频驱动的薄弱监督任务中选择记录誊本的重要性。