Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have revealed the potential of aligning multi-modal representations with contrastive learning. However, these works require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we propose a stack of novel methods, which significantly cut down the heavy resource dependency and allow us to conduct dual-encoder multi-modal representation alignment with limited resources. Besides, we provide a reproducible baseline of competitive results, namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100 GPUs. Additionally, we collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods, further proving the effectiveness of our methods on large-scale data. We hope that this work will provide useful data points and experience for future research in contrastive vision-language pre-training. Code is available at https://github.com/zerovl/ZeroVL.
翻译:培训前工作(例如,CLIP和ALIGN)已经揭示了使多模式代表与对比学习相结合的潜力,然而,这些工作需要大量数据和计算资源(例如,10亿级网络数据和数百个GPU),这使资源有限的研究人员无法复制和进一步探索,为此,我们提议了一系列新方法,大大减少了对资源的高度依赖,使我们能够在有限的资源下进行双编码多模式代表调整。此外,我们提供了可复制的竞争结果基线,即ZeroVL,只有14M可公开查阅的学术数据集和8V100GPU。此外,我们收集了100M网络数据,用于培训前,并取得了比目前先进方法可比或优越的结果,进一步证明了我们在大规模数据上的方法的有效性。我们希望这项工作将为未来对对比性视觉语言培训前的研究提供有用的数据点和经验。代码可在 https://github.com/zerovl/ZeroVrero查阅。