With the advent of rich visual representations and pre-trained language models, video captioning has seen continuous improvement over time. Despite the performance improvement, video captioning models are prone to hallucination. Hallucination refers to the generation of highly pathological descriptions that are detached from the source material. In video captioning, there are two kinds of hallucination: object and action hallucination. Instead of endeavoring to learn better representations of a video, in this work, we investigate the fundamental sources of the hallucination problem. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. To alleviate these problems, we propose two robust solutions: (a) the introduction of auxiliary heads trained in multi-label settings on top of the extracted visual features and (b) the addition of context gates, which dynamically select the features during fusion. The standard evaluation metrics for video captioning measures similarity with ground truth captions and do not adequately capture object and action relevance. To this end, we propose a new metric, COAHA (caption object and action hallucination assessment), which assesses the degree of hallucination. Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets, especially by a massive margin in CIDEr score.
翻译:随着丰富的视觉表现和经过事先培训的语言模型的出现,视频字幕随着时间的推移不断得到改善。尽管业绩改进,视频字幕模型容易产生幻觉。幻觉是指产生与源材料分离的高度病理描述。在视频字幕中,有两种幻觉:对象和动作幻觉。我们不是努力学习更好的视频描述,而是在这项工作中调查幻觉问题的基本原因。我们确定了三个主要因素:(一) 从预培训模型中提取的视频特征不足,(二) 多模式融合期间来源和目标背景的不当影响,以及(三) 培训战略中的暴露偏差。为了缓解这些问题,我们提出了两个强有力的解决方案:(a) 在提取的视觉特征上引入多标签环境中的辅助头部,(b) 添加背景门,在混杂期间动态地选择特征。我们确定了视频描述措施的标准评价指标与实地真相说明相似,没有适当捕捉到目标和行动相关性。为此,我们提议采用新的衡量标准、CA-VHIS(C-SVHS-SB-SD)的大规模数据评估(C-SR-SVA-C-C-C-SLA-SDSAL-SDSDSD),即我们对磁性目标和动作的大规模数据评估。