Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets. In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, object detector models are frozen and do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively and quantitatively that this can improve grounding. We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5% in CIDEr and +1.3% in BLEU-4 metrics.
翻译:在视觉字幕方面已取得重大进展,主要依靠预先训练的特征和后来固定的物体探测器,这些探测器是自动递减模型的丰富投入。但是,这些方法的一个关键限制是,模型的输出仅以物体探测器的输出为条件。假设这种输出能够代表所有必要的信息是不切实际的,特别是当探测器跨数据集传输时。在这项工作中,我们解释由这一假设引出的图形模型,并提议添加一个辅助输入来代表缺少的信息,如天体关系等。我们特别提议从视觉基因组数据集中挖掘属性和关系,并设置说明说明模型的描述模式。非常关键的一点是,我们提议(并显示其重要性)使用多模式预训练模型(CLIP)来检索这种背景描述。此外,物体探测器模型被冻结,并且不够丰富,因此,我们提议在图像上设置检测和描述输出输出输出的输出结果,并且从质量和数量上显示,这可以改进地面的描述。我们提议(并显示)使用多模式模型模型(B+R)的每一项重要分析方法,我们具体地验证了BSimimal 的每一个重要模型的B级模型分析。