Our objective in this work is fine-grained classification of actions in untrimmed videos, where the actions may be temporally extended or may span only a few frames of the video. We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set. We make the following four contributions: (I) We propose a new model - a Temporal Query Network - which enables the query-response functionality, and a structural understanding of fine-grained actions. It attends to relevant segments for each query with a temporal attention mechanism, and can be trained using only the labels for each query. (ii) We propose a new way - stochastic feature bank update - to train a network on videos of various lengths with the dense sampling required to respond to fine-grained queries. (iii) We compare the TQN to other architectures and text supervision methods, and analyze their pros and cons. Finally, (iv) we evaluate the method extensively on the FineGym and Diving48 benchmarks for fine-grained action classification and surpass the state-of-the-art using only RGB features.
翻译:在这项工作中,我们的目标是对未剪辑的视频中的行动进行细微分类,行动可暂时延长,或仅涉及视频的几个框架。我们将此输入一个问答机制,每个查询都涉及一个特定的问题,并有自己的答复标签。我们作出以下四项贡献:(一) 我们提出一个新的模型——一个时空查询网络,使查询-反应功能得以实现,对微调行动的结构性理解。它以时间关注机制处理每个查询的相关部分,并仅使用每个查询的标签进行培训。 (二) 我们提出一种新的方法—— 随机特征库更新—— 以培训不同长度的视频网络,并进行精密查询所需的密集取样。 (三) 我们将TQN与其他架构和文本监督方法进行比较,并分析其支持与支持方法。最后,(四) 我们仅使用RGB特性评估精细行动分类法和Dving48基准的广泛方法,并超越了状态。