This paper focuses on weakly-supervised action alignment, where only the ordered sequence of video-level actions is available for training. We propose a novel Duration Network, which captures a short temporal window of the video and learns to predict the remaining duration of a given action at any point in time with a level of granularity based on the type of that action. Further, we introduce a Segment-Level Beam Search to obtain the best alignment, that maximizes our posterior probability. Segment-Level Beam Search efficiently aligns actions by considering only a selected set of frames that have more confident predictions. The experimental results show that our alignments for long videos are more robust than existing models. Moreover, the proposed method achieves state of the art results in certain cases on the popular Breakfast and Hollywood Extended datasets.
翻译:本文侧重于监管不力的行动对齐, 只有视频级行动的顺序可以用于培训。 我们提议一个新的“持续时间网络 ”, 它捕捉视频的短时间窗口, 并学习以基于该动作的类型的颗粒度预测任何时刻特定行动的剩余时间。 此外, 我们引入了部分一级光束搜索, 以获得最佳的对齐, 从而最大限度地增加我们的后继概率。 部分一级光束搜索通过只考虑一组有更自信的预测的选定框架来有效对准行动。 实验结果显示, 我们对于长视频的对齐比现有模型更强大。 此外, 拟议的方法在某些案例中实现了流行的早餐和好莱坞扩展数据集的最新结果 。