Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by $95\%$, compared to a baseline model. We adopt the MiniLM structure to reduce the computation cost of the transformer module after comparing different compact BERT models. In addition, we improve the MiniVLM pre-training by adding $7M$ Open Images data, which are pseudo-labeled by a state-of-the-art captioning model. We also pre-train with high-quality image tags obtained from a strong tagging model to enhance cross-modality alignment. The large models are used offline without adding any overhead in fine-tuning and inference. With the above design choices, our MiniVLM reduces the model size by $73\%$ and the inference time cost by $94\%$ while being able to retain $94-97\%$ of the accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the state-of-the-art VL research for on-the-edge applications.
翻译:最近的视觉语言( VL) 研究显示取得了显著的进展,学习了来自大型图像-变压器模型的图像-变压器模型的通用表示式,然后对下游VL任务进行了微调。虽然现有的研究侧重于在大型预培训模型的启发下实现高精度,但建立轻量模型在实践中具有很大的价值,但探索较少。在本文中,我们提议了一个较小和更快的VL模型,即MiniVLM, 可以在各种下游任务(如较大的对等)上的良好表现上进行微调。MiniVLM由两个模块组成,一个视觉特征提取器和一个基于变压器的视觉-变压器-变压模块。我们设计了一个两阶段的高效功能提取器(TEE),在一阶段高效设计Det网络的启发下,将视觉特征提取的时间成本大幅降低95美元,而与基线模型相比,我们采用微型LMiniLM结构来降低变压器模块的计算成本。此外,我们改进了MiniVLNM的预培训,方法是增加7ML值的易变压数据,我们获得的图像- drefer lafle lax lax lax lax lax lax lax lax lax lax lax lax