Video captioning aims to describe events in a video with natural language. In recent years, many works have focused on improving captioning models' performance. However, like other text generation tasks, it risks introducing factual errors not supported by the input video. These factual errors can seriously affect the quality of the generated text, sometimes making it completely unusable. Although factual consistency has received much research attention in text-to-text tasks (e.g., summarization), it is less studied in the context of vision-based text generation. In this work, we conduct a detailed human evaluation of the factuality in video captioning and collect two annotated factuality datasets. We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field. However, existing evaluation metrics are mainly based on n-gram matching and show little correlation with human factuality annotation. We further propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning. The datasets and metrics will be released to promote future research for video captioning.
翻译:视频字幕的目的是用自然语言描述视频中的事件。近年来,许多工作都侧重于改进字幕模型的性能。 但是,与其他文本生成任务一样,它有可能引入没有输入视频支持的事实错误。这些事实错误会严重影响生成文本的质量,有时甚至使其完全无法使用。虽然事实上的一致性在文本到文本的任务(例如总结)中引起了许多研究关注,但在基于愿景的文本生成方面却没有受到多少研究。在这项工作中,我们对视频字幕中的事实质量进行详细的人文评估,并收集了两个附加说明的事实质量数据集。我们发现,模型生成的句子中有57.0%存在事实错误,表明这是该领域的一个严重问题。然而,现有的评价指标主要基于正文匹配,与人类事实质量说明几乎没有关联。我们进一步建议采用一种薄弱的、以模型为基础的基于事实质量的衡量格式FactVC,它比以往关于视频字幕的定性评价标准要差。数据集和指标将发布,以促进未来对视频字幕的研究。</s>