Grounded video description (GVD) encourages captioning models to attend to appropriate video regions (e.g., objects) dynamically and generate a description. Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description. However, such design mainly focuses on object word generation and thus may ignore fine-grained information and suffer from missing visual concepts. Moreover, relational words (e.g., "jump left or right") are usual spatio-temporal inference results, i.e., these words cannot be grounded on certain spatial regions. To tackle the above limitations, we design a novel relational graph learning framework for GVD, in which a language-refined scene graph representation is designed to explore fine-grained visual concepts. Furthermore, the refined graph can be regarded as relational inductive knowledge to assist captioning models in selecting the relevant information it needs to generate correct words. We validate the effectiveness of our model through automatic metrics and human evaluation, and the results indicate that our approach can generate more fine-grained and accurate description, and it solves the problem of object hallucination to some extent.
翻译:地面视频描述(GVD)鼓励字幕模型以动态的方式关注适当的视频区域(如物体)并生成描述。这种设置可以帮助解释字幕模型的决定,防止模型在描述中产生幻觉物体。然而,这种设计主要侧重于对象生成单词,从而可能忽略细微的图像信息,并因此可能缺少视觉概念。此外,相关词(如“左倾或右倾”)通常都是空洞-时空推断结果,即这些词不能以某些空间区域为基础。为了应对上述限制,我们为GVD设计了一个全新的关系图学习框架,在其中设计了一种语言精细的场景图示演示,以探索精细的视觉概念。此外,精细的图表可被视为相关感官知识,以协助说明模型选择正确词所需的相关信息。我们通过自动测量和人文评估来验证我们的模型的有效性,结果表明我们的方法能够产生更精细和准确的图像度,并解决其问题的程度。