In the present work, we show that the performance of formula-driven supervised learning (FDSL) can match or even exceed that of ImageNet-21k and can approach that of the JFT-300M dataset without the use of real images, human supervision, or self-supervision during the pre-training of vision transformers (ViTs). For example, ViT-Base pre-trained on ImageNet-21k and JFT-300M showed 83.0 and 84.1% top-1 accuracy when fine-tuned on ImageNet-1k, and FDSL showed 83.8% top-1 accuracy when pre-trained under comparable conditions (hyperparameters and number of epochs). Especially, the ExFractalDB-21k pre-training was calculated with x14.2 fewer images compared with JFT-300M. Images generated by formulas avoid privacy and copyright issues, labeling costs and errors, and biases that real images suffer from, and thus have tremendous potential for pre-training general models. To understand the performance of the synthetic images, we tested two hypotheses, namely (i) object contours are what matter in FDSL datasets and (ii) an increased number of parameters for label creation improves performance in FDSL pre-training. To test the former hypothesis, we constructed a dataset that consisted of simple object contour combinations. We found that this dataset matched the performance of fractal databases. For the latter hypothesis, we found that increasing the difficulty of the pre-training task generally leads to better fine-tuning accuracy.
翻译:本研究证明,在视觉Transformer(ViT)预训练过程中,不使用真实图像、人工监督或自监督的情况下,公式驱动监督学习(FDSL)的性能可媲美甚至超越ImageNet-21k数据集,并接近JFT-300M数据集的表现。例如,在ImageNet-21k和JFT-300M上预训练的ViT-Base模型在ImageNet-1k微调后分别达到83.0%和84.1%的top-1准确率,而在可比条件(超参数与训练轮次)下通过FDSL预训练的模型取得了83.8%的top-1准确率。特别值得注意的是,ExFractalDB-21k预训练所使用的图像数量比JFT-300M减少了14.2倍。公式生成的图像避免了真实图像面临的隐私与版权问题、标注成本与误差以及数据偏见,因此在通用模型预训练方面具有巨大潜力。为探究合成图像的性能机制,我们验证了两个假设:(i)FDSL数据集中物体轮廓信息起关键作用;(ii)增加标签生成参数可提升FDSL预训练性能。针对第一个假设,我们构建了由简单物体轮廓组合构成的数据集,发现其性能与分形数据库相当。对于第二个假设,我们发现提高预训练任务的难度通常能带来更好的微调准确率。