Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our interpretation, including inter- and intra-modal uncertainty. Little effort has studied the modeling of this uncertainty, particularly in pre-training on unlabeled datasets and fine-tuning in task-specific downstream datasets. In this paper, we project the representations of all modalities as probabilistic distributions via a Probability Distribution Encoder (PDE) by utilizing sequence-level interactions. Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information and more complex relationships. Furthermore, we integrate uncertainty modeling with popular pre-training frameworks and propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). The fine-tuned models are applied to challenging downstream tasks, including image-text retrieval, visual question answering, visual reasoning, and visual entailment, and achieve state-of-the-art results.
翻译:多模态语义理解通常必须处理不确定性,这意味着获得的信息往往涉及多个目标。这种不确定性对于我们的理解,包括模态之间和模态内部的不确定性,是有问题的。目前很少有研究关注这种不确定性的建模,尤其是在无标记数据集上进行的预训练和在任务特定的下游数据集上进行的微调。在本文中,我们通过利用序列级交互,将所有模态的表示投影为概率分布,以概率分布编码器(PDE)的形式实现。与现有的确定性方法相比,这种不确定性建模可以传达更丰富的多模态语义信息和更复杂的关系。此外,我们将不确定性建模与流行的预训练框架相结合,并提出适当的预训练任务:基于分布的视听对比学习(D-VLC)、基于分布的掩码语言建模(D-MLM)和基于分布的图像文本匹配(D-ITM)。微调后的模型应用于具有挑战性的下游任务,包括图像文本检索、视觉问答、视觉推理和视觉蕴含,并取得了最先进的结果。