直观与自然语言解释相容的直观对齐和字面限制 (Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations)

Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process. Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation. However, the pre-trained vision-language models mainly build token-level alignment between text and image yet ignore the high-level semantic alignment between the phrases (chunks) and visual contents, which is critical for vision-language reasoning. Moreover, the explanation generator based only on the encoded joint representation does not explicitly consider the critical decision-making points of relation inference. Thus the generated explanations are less faithful to visual-language reasoning. To mitigate these problems, we propose a unified Chunk-aware Alignment and Lexical Constraint based method, dubbed as CALeC. It contains a Chunk-aware Semantic Interactor (arr. CSI), a relation inferrer, and a Lexical Constraint-aware Generator (arr. LeCG). Specifically, CSI exploits the sentence structure inherent in language and various image regions to build chunk-aware semantic alignment. Relation inferrer uses an attention-based reasoning network to incorporate the token-level and chunk-level vision-language representations. LeCG utilizes lexical constraints to expressly incorporate the words or chunks focused by the relation inferrer into explanation generation, improving the faithfulness and informativeness of the explanations. We conduct extensive experiments on three datasets, and experimental results indicate that CALeC significantly outperforms other competitor models on inference accuracy and quality of generated explanations.

翻译：与自然语言解释的视觉联系旨在推断文本图像配对和生成解释决策过程的句子之间的关系。过去的方法主要依靠预先训练的视觉语言模型来进行关系推断,而用语言模型来作出相应的解释。然而,经过训练的视觉语言模型主要在文本和图像之间建立象征性的对齐,却忽视了对视觉推理至关重要的词组(查克斯)和视觉内容之间的高层次语义调整。此外,仅以编码化联合代表制为基础的解释生成器并没有明确考虑信息准确性关系推断的关键决策点。因此,所产生的解释不那么忠实于视觉推理。为了减轻这些问题,我们建议采用统一的Chunk-觉识调和词汇调控法方法,代之以CaRC-awa-aware Semantic Intercal Exculations (ar.CSI), 一种对视觉推理至关重要的词组解释关系。以及一个广义的Clax-awa Protaint-awa Ganger (Darr. LeCG.) 没有明确考虑信息中的准确性判断点。CSIal-Cislal real relational deal deviewal deviewdational laction laction the laction the lax the orizations and lafforizationalational and laxal laffolations laxal ex the laxal and orizations and ex thes ands ands and orizations and ex and ex and ex and exalizationalizationalizationalationalizationaldaldaldalizations ex ex and and ex and ex and and and lader the lader thes ands and lader the exal and labal exaldaldaldaldaldaldaldal exaldaldaldaldaldaldaldaldaldal ands and ex and a and a and exal and a and a ex and ex and ex and exal and exal exal exal and a ex and a exal exal ex and a ex and exal exal and