Video captioning aims to understand the spatio-temporal semantic concept of the video and generate descriptive sentences. The de-facto approach to this task dictates a text generator to learn from \textit{offline-extracted} motion or appearance features from \textit{pre-trained} vision models. However, these methods may suffer from the so-called \textbf{\textit{"couple"}} drawbacks on both \textit{video spatio-temporal representation} and \textit{sentence generation}. For the former, \textbf{\textit{"couple"}} means learning spatio-temporal representation in a single model(3DCNN), resulting the problems named \emph{disconnection in task/pre-train domain} and \emph{hard for end-to-end training}. As for the latter, \textbf{\textit{"couple"}} means treating the generation of visual semantic and syntax-related words equally. To this end, we present $\mathcal{D}^{2}$ - a dual-level decoupled transformer pipeline to solve the above drawbacks: \emph{(i)} for video spatio-temporal representation, we decouple the process of it into "first-spatial-then-temporal" paradigm, releasing the potential of using dedicated model(\textit{e.g.} image-text pre-training) to connect the pre-training and downstream tasks, and makes the entire model end-to-end trainable. \emph{(ii)} for sentence generation, we propose \emph{Syntax-Aware Decoder} to dynamically measure the contribution of visual semantic and syntax-related words. Extensive experiments on three widely-used benchmarks (MSVD, MSR-VTT and VATEX) have shown great potential of the proposed $\mathcal{D}^{2}$ and surpassed the previous methods by a large margin in the task of video captioning.
翻译:视频字幕旨在理解视频的spatio- 时空语义概念, 并生成描述性句子 。 对于前一种, defacto 方法要求从\ textit{ 脱线] 视觉模型的动作或外观功能中学习文字生成器 。 然而, 这些方法可能受到所谓的 vtextbf_ textit{ copple} 的偏差 。 但是, 这些方法可能受到所谓的 vtriit{ 视频- 时空表达器 (vital- tattio- 时间代表器 } 和\ textitive{ 生成} 。 对于前一种模式来说,\ textbtriot- text} 和\ textitle{ textitle{ 生成器 。 对于前一种图像的生成和同步, Stual2\ text_ text 显示一个双向流流流流流流流流流流数据 。