Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorized into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five V&L BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.
翻译:大规模培训前和具体任务的微调现在是计算机视觉和自然语言处理方面许多任务的标准方法。最近,为培训前视力和语言BERT提出了多种方法,以应对AI这两个关键领域交汇处的挑战。这些模型可以分为单流或双流编码器。我们研究这两个类别之间的差异,并表明它们如何在单一理论框架内统一。然后,我们进行有控制的实验,以辨别五个V & L BERT之间的经验差异。我们的实验表明,培训数据和超光谱仪是所报告的结果之间大多数差异的原因,但它们也表明嵌入层在这些大型模型中起着关键作用。