Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee learnt embedding features are matched with the captions semantics. Comprehensive experiments and extensive ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior state-of-the-art methods on accuracy and diversity. Source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.
翻译:视频段落字幕旨在生成一个多语种描述,显示几个时间事件地点的未剪辑的视频,以连贯的方式讲述故事。在人类感知过程之后,通过将场景分解成视觉(如人、动物)和非视觉组成部分(如动作、关系),在视觉和语言的相互影响下,我们首先提出一个视觉语言(VL)特征。在拟议的VL特征中,场景以三种模式建模,包括:(一) 全球视觉环境;(二) 当地视觉主媒介;(三) 语言场景要素。然后我们引入一个自动反向变异器-内变异器(TinT),同时在视频中同时捕捉内和活动间内容(如动作、关系)的语义一致性。最后,我们提出了一个新的VLV-对比损失功能,以保障所学嵌入特性与说明语义相匹配。关于活动网Cap和YouCookII数据集的全面实验和广泛的对比研究显示,拟议的视觉-LV-In-In-Tradef-A-Transtictor-Trading-Trading-Tradefal-I-Traction-Arfal-Traction-Arfolfolf-LT-LT