Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have revealed the potential of aligning multi-modal representations with contrastive learning. However, these works require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we explore a stack of simple but effective heuristics, and provide a comprehensive training guidance, which allows us to conduct dual-encoder multi-modal representation alignment with limited resources. We provide a reproducible strong baseline of competitive results, namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100 GPUs. Additionally, we collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods, further proving the effectiveness of our method on large-scale data. We hope that this work will provide useful data points and experience for future research in multi-modal pre-training. Our code and pre-trained models will be released to facilitate the research community.
 翻译:培训前工作(如CLIP和ALIGN)先行的双编码双编码器前工作(如CLIP和ALIGN)揭示了使多式代表与对比学习相结合的潜力,然而,这些工作需要大量数据和计算资源(如10亿级网络数据和数百个GPU),使资源有限的研究人员无法复制和进一步探索,为此,我们探索了一套简单而有效的制革方法,并提供了全面的培训指导,使我们能够根据有限的资源进行双编码多式代表调整。我们提供了可复制的竞争性结果的强有力基线,即ZeroVL,只有14M可公开查阅的学术数据集和8V100个VPUS。此外,我们收集了100M网络数据,用于培训前,并取得了比最新方法可比或优越的结果,进一步证明了我们在大规模数据上的方法的有效性。我们希望这项工作将为今后在多式培训前的研究中提供有用的数据点和经验。我们的代码和预培训模式将公布,以促进社区的研究。