视频录像字幕视觉常识-有觉识代表网 (Visual Commonsense-aware Representation Network for Video Captioning)

Generating consecutive descriptions for videos, i.e., Video Captioning, requires taking full advantage of visual representation along with the generation process. Existing video captioning methods focus on making an exploration of spatial-temporal representations and their relationships to produce inferences. However, such methods only exploit the superficial association contained in the video itself without considering the intrinsic visual commonsense knowledge that existed in a video dataset, which may hinder their capabilities of knowledge cognitive to reason accurate descriptions. To address this problem, we propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN), for video captioning. Specifically, we construct a Video Dictionary, a plug-and-play component, obtained by clustering all video features from the total dataset into multiple clustered centers without additional annotation. Each center implicitly represents a visual commonsense concept in the video domain, which is utilized in our proposed Visual Concept Selection (VCS) to obtain a video-related concept feature. Next, a Conceptual Integration Generation (CIG) is proposed to enhance the caption generation. Extensive experiments on three publicly video captioning benchmarks: MSVD, MSR-VTT, and VATEX, demonstrate that our method reaches state-of-the-art performance, indicating the effectiveness of our method. In addition, our approach is integrated into the existing method of video question answering and improves this performance, further showing the generalization of our method. Source code has been released at https://github.com/zchoi/VCRN.

翻译：连续制作视频描述,即视频描述,需要充分利用视频的视觉展示以及制作过程。现有的视频字幕方法侧重于探索空间时空表现及其关系,以得出推理。然而,这些方法只是利用视频本身所含的表面关联,而没有考虑到视频数据集中存在的内在视觉常识知识,这可能妨碍其知识认知能力,从而有理由准确描述。为了解决这一问题,我们提议了一个简单而有效的方法,称为视觉常识-觉展示网络(VCRN),用于视频描述。具体地说,我们通过将全部数据集的所有视频特征集中到多个集束中心而获得的视频词典、插接和播放组件部分,而没有附加注解。每个中心隐含着视频域的视觉常识概念概念概念,这可能会妨碍其获取与视频相关的认知能力。我们提议的视觉概念选择(VCS) 下一步,概念整合(CIG) 是为了加强字幕生成。我们在三种公开视频描述基准上的广泛实验:MSVD、MSR-CRR) 展示了我们通用的性能学方法, 展示了我们现有的VG-S-S-SLS-S-SLS-SLSULS-S-S-S-S-SULS-S-S-S-S-S-SULS-S-S-S-S-S-S-SUAT-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S