Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee learnt embedding features are matched with the captions semantics. Comprehensive experiments and extensive ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior state-of-the-art methods on accuracy and diversity.
翻译:视频段落字幕旨在生成一个多语种描述,显示几个时间事件地点的未剪辑的视频,以连贯的方式讲述故事。在人类感知过程之后,通过将场景分解成视觉(例如人类、动物)和非视觉组成部分(例如行动、关系),在视觉和语言的相互影响下,我们首先提出视觉语言(VL)特征。在拟议的VL特征中,场景以三种模式建模,包括:(一) 全球视觉环境;(二) 当地视觉主媒介;(三) 语言场景要素。然后我们引入一个自动反向变异器-内变异器(TinT),在视频中同时捕捉内和活动间内容(例如动作、关系)的语义一致性。最后,我们提出了一个新的VL对比损失功能,以保证所学嵌入的特征与说明语义的语义匹配。关于活动网卡和YouCookII数据集的全面实验和广泛对比研究显示,拟议的视觉变异器-内变异器-变异式(LTinforformat-stalformation-stalformation-traformations) 和前变形方法。最后,我们提出了一个新的变形方法。