The task of visual dialog requires a multimodal chatbot to answer sequential questions from humans about image content. Prior work performs the standard likelihood training for answer generation on the positive instances (involving correct answers). However, the likelihood objective often leads to frequent and dull outputs and fails to exploit the useful knowledge from negative instances (involving incorrect answers). In this paper, we propose a Unified Multimodal Model with UnLikelihood Training, named UniMM-UL, to tackle this problem. First, to improve visual dialog understanding and generation by multi-task learning, our model extends ViLBERT from only supporting answer discrimination to holding both answer discrimination and answer generation seamlessly by different attention masks. Specifically, in order to make the original discriminative model compatible with answer generation, we design novel generative attention masks to implement the autoregressive Masked Language Modeling (autoregressive MLM) task. And to attenuate the adverse effects of the likelihood objective, we exploit unlikelihood training on negative instances to make the model less likely to generate incorrect answers. Then, to utilize dense annotations, we adopt different fine-tuning methods for both generating and discriminating answers, rather than just for discriminating answers as in the prior work. Finally, on the VisDial dataset, our model achieves the best generative results (69.23 NDCG score). And our model also yields comparable discriminative results with the state-of-the-art in both single-model and ensemble settings (75.92 and 76.17 NDCG scores).
翻译:视觉对话的任务要求一个多式聊天室来回答人类有关图像内容的连续问题。 先前的工作是在正面实例( 涉及正确的答案) 中进行标准可能性培训, 以生成答案( 涉及正确的答案) 。 但是, 可能性目标往往导致经常和无趣的产出, 并且未能利用负面实例( 涉及错误的答案) 的有用知识。 在本文中, 我们提出一个名为 UIMM- ULL 的统一多式模式, 用于应对这一问题。 首先, 通过多任务学习来改善视觉对话的理解和生成, 我们的模式将 ViLBERT 从支持回答歧视的仅仅支持回答歧视, 以同时保持回答歧视, 并且用不同的关注掩码来回答生成和回答生成。 具体来说, 为了让最初的歧视性模式模式与答案相容, 我们设计了新型语言模型 173, 并且用最有歧视的排序的方法 。 最终, 也用最有歧视性的结果 。