通过最大限度地实现多模式相互信息最大化实现所有在一对一的预培训 (Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information)

To effectively exploit the potential of large-scale models, various pre-training strategies supported by massive data from different sources are proposed, including supervised pre-training, weakly-supervised pre-training, and self-supervised pre-training. It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models. However, current works adopt a multi-stage pre-training system, where the complex pipeline may increase the uncertainty and instability of the pre-training. It is thus desirable that these strategies can be integrated in a single-stage manner. In this paper, we first propose a general multi-modal mutual information formula as a unified optimization target and demonstrate that all existing approaches are special cases of our framework. Under this unified perspective, we propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training). Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, COCO object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation. Notably, we successfully pre-train a billion-level parameter image backbone and achieve state-of-the-art performance on various benchmarks. Code shall be released at https://github.com/OpenGVLab/M3I-Pretraining.

翻译：为了有效利用大型模式的潜力,提出了各种培训前战略,并辅之以来自不同来源的大量数据,包括受监督的训练前、受监督的训练前和自监督的训练前和自监督的训练前;已经证明,将多种培训前战略和各种模式/来源的数据结合起来,可以大大促进大规模模式的培训;然而,目前的工作采用多阶段培训前系统,复杂的培训前系统可能会增加培训前的不确定性和不稳定性;因此,最好能够将这些战略纳入单一阶段;在本文件中,我们首先提出一般的多模式相互信息公式,作为统一的优化目标,并表明所有现有办法都是我们框架的特殊情况;在这一统一的观点下,我们提议采用全在一阶段的训练前方法,名为优化多模式相互信息前培训(M3I培训前),在各种愿景基准方面,包括图像网络分类、COCO物体探测、LVIS长期目标探测、OFADL20级的标准化标准,以及STARC-DRS-DADRSA/DRADSADRAADRA级的升级部分中,我们的方法比以前的培训前方法取得更好的业绩。