UTC:一个统一变换器,为视觉对话进行任务间对立学习 (UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog)

Visual Dialog aims to answer multi-round, interactive questions based on the dialog history and image content. Existing methods either consider answer ranking and generating individually or only weakly capture the relation across the two tasks implicitly by two separate models. The research on a universal framework that jointly learns to rank and generate answers in a single model is seldom explored. In this paper, we propose a contrastive learning-based framework UTC to unify and facilitate both discriminative and generative tasks in visual dialog with a single model. Specifically, considering the inherent limitation of the previous learning paradigm, we devise two inter-task contrastive losses i.e., context contrastive loss and answer contrastive loss to make the discriminative and generative tasks mutually reinforce each other. These two complementary contrastive losses exploit dialog context and target answer as anchor points to provide representation learning signals from different perspectives. We evaluate our proposed UTC on the VisDial v1.0 dataset, where our method outperforms the state-of-the-art on both discriminative and generative tasks and surpasses previous state-of-the-art generative methods by more than 2 absolute points on Recall@1.

翻译：以对话框历史和图像内容为基础,现有方法要么考虑回答排名,单独生成,要么仅弱化地捕捉到两个不同模式隐含的两种任务之间的关系。很少探讨关于一个共同学习在单一模式中排位和生成答案的普遍框架的研究。在本文中,我们提议一个对比式学习框架UTC,以单一模式统一并促进视觉对话中的歧视性和基因化任务。具体地说,考虑到先前学习模式的内在局限性,我们设计了两种任务间对比性损失,即背景对比性损失和回答对比性损失,以使歧视和基因化任务相互加强。这两种互补对比性损失利用了对话背景和目标答案作为主点,从不同角度提供代表性学习信号。我们评估了我们关于VisDial v1.0数据集的拟议UTC,我们的方法在VisDial v1.0数据集上超越了对歧视性和基因化任务和基因化工作的现状,并在Recall@1上以超过2个绝对点的超前状态基因化方法。