The state of the arts in vision-language pretraining (VLP) achieves exemplary performance but suffers from high training costs resulting from slow convergence and long training time, especially on large-scale web datasets. An essential obstacle to training efficiency lies in the entangled prediction rate (percentage of tokens for reconstruction) and corruption rate (percentage of corrupted tokens) in masked language modeling (MLM), that is, a proper corruption rate is achieved at the cost of a large portion of output tokens being excluded from prediction loss. To accelerate the convergence of VLP, we propose a new pretraining task, namely, free language modeling (FLM), that enables a 100% prediction rate with arbitrary corruption rates. FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted. FLM-trained models are encouraged to learn better and faster given the same GPU time by exploiting bidirectional contexts more flexibly. Extensive experiments show FLM could achieve an impressive 2.5x pretraining time reduction in comparison to the MLM-based methods, while keeping competitive performance on both vision-language understanding and generation tasks. Code will be public at https://github.com/TencentARC/FLM.
翻译:本论文提出一种新的预训练任务,即自由语言建模(FLM),以加速视觉语言预训练(VLP)的收敛速度和训练时间。当前的VLP方法通常采用掩码语言模型(MLM)来训练模型,但这种方法会出现预测比例和损坏比例的交织问题,即为了获得合适的损坏比例,就需要把大量输出标记从预测损失中排除。与此相比,FLM可以消除预测比例和损坏比例之间的耦合,同时允许为每个待预测的标记自定义损坏范围,并在更灵活的双向上下文中训练模型,从而更高效地学习语言特征。实验结果表明,相对于基于MLM的方法,FLM可以将预训练时间缩短2.5倍,同时在视觉语言理解和生成任务上保持有竞争力的性能。代码将在https://github.com/TencentARC/FLM公开发布。