We present Knowledge Enhanced Multimodal BART (KM-BART), which is a Transformer-based sequence-to-sequence model capable of reasoning about commonsense knowledge from multimodal inputs of images and texts. We adapt the generative BART architecture to a multimodal model with visual and textual inputs. We further develop novel pretraining tasks to improve the model performance on the Visual Commonsense Generation (VCG) task. In particular, our pretraining task of Knowledge-based Commonsense Generation (KCG) boosts model performance on the VCG task by leveraging commonsense knowledge from a large language model pretrained on external commonsense knowledge graphs. To the best of our knowledge, we are the first to propose a dedicated task for improving model performance on the VCG task. Experimental results show that our model reaches state-of-the-art performance on the VCG task by applying these novel pretraining tasks.
翻译:我们介绍了知识增强多式BART(KM-BART),这是一个基于变异器的序列到序列模型,能够从图像和文本的多式联运投入中推理普通知识;我们将基因型BART结构改造为具有视觉和文字投入的多式联运模式;我们进一步制定新的培训前任务,以改进视觉常识一代(VCG)任务的示范性表现;特别是,我们基于知识的常识一代(KCG)的培训前任务,通过利用在外部普通知识图上预先培训过的大型语言模型的普通知识,提升了VCG任务的示范性表现。我们最了解的是,我们首先提出了改进VCG任务示范性表现的专门任务。实验结果显示,我们的模型通过应用这些新的预培训任务,达到了VCG任务的最新水平。