We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.
翻译:我们推出EVA,这是一个以视觉为中心的基础模型,目的是利用只有公开可获取的数据来探索规模的视觉代表的局限性。EVA是一个香草VIT,是用来重建以可见图像补丁为条件的遮掩图像-文字一致的视觉特征的先期训练。通过这项托辞任务,我们可以有效地将EVA提升到10亿参数,并针对具有广泛代表性的下游任务,如图像识别、视频动作识别、目标检测、实例分解和语义分解等广泛的具有代表性的任务,如图像识别、视频动作识别、目标检测、实例分解和语义分解等,而无需经过大量监管的培训。此外,我们观察到,在扩大EVA的转让学习绩效方面发生了质的变化,而在其他模型中没有出现这种变化。例如,EVA在具有挑战性的大型词汇分解任务中取得了巨大的飞跃:我们的模型在LVISv1.0数据集上几乎实现了同样的最先进的业绩,而COOD数据集只有80个类别。除了纯的视觉编码外,EVA还可以作为视觉核心、多模式的连接图像和文本的连接。我们从CLIPA的模型中找到一个来自EVA的巨型的模型的模型培训的模型的模型,可以大大地稳定一个快速的升级的升级的升级的模型和升级的升级的模型。