Inspired by the success of BERT, several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. In particular, LXMERT and UNITER adopt visual region feature regression and label classification as pretext tasks. However, they tend to suffer from the problems of noisy labels and sparse semantic annotations, based on the visual features having been pretrained on a crowdsourced dataset with limited and inconsistent semantic labeling. To overcome these issues, we propose unbiased Dense Contrastive Visual-Linguistic Pretraining (DCVLP), which replaces the region regression and classification with cross-modality region contrastive learning that requires no annotations. Two data augmentation strategies (Mask Perturbation and Intra-/Inter-Adversarial Perturbation) are developed to improve the quality of negative samples used in contrastive learning. Overall, DCVLP allows cross-modality dense region contrastive learning in a self-supervised setting independent of any object annotations. We compare our method against prior visual-linguistic pretraining frameworks to validate the superiority of dense contrastive learning on multimodal representation learning.
翻译:在BERT的成功激励下,提出了几种多式联运代表学习方法,这些方法共同代表图像和文字,这些方法通过从大型多式联运预培训中获取高层次的语义信息而取得优异性能,特别是LTMERT和UNITER采用视觉区域特征回归和标签分类作为托辞任务,但是,它们往往会因杂乱的标签和稀疏的语义说明问题而受害,其依据是视觉特征,这些特征先于以数量有限和前后不一致的语义标签为特点的众源数据集。为了克服这些问题,我们提议采用不带偏见的多式多语种视觉语言预培训(DCVLP)来取代区域回归和分类,代之以无需说明的跨现代对比区域对比学习。我们制定了两种数据强化战略(Mask Perturbation和Intra/Inter-Aversarial Perturbation),目的是提高对比性学习中使用的负面样本的质量。总体来说,DCVLP允许跨模式密集的区域对比学习,以自我控制的方式独立地设置任何对象说明。我们比较了先前的图像上高超度代表制的对比框架。