With the rise of Transformers as the standard for language processing, and their advancements in computer vision, there has been a corresponding growth in parameter size and amounts of training data. Many have come to believe that because of this, transformers are not suitable for small sets of data. This trend leads to concerns such as: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we aim to present an approach for small-scale learning by introducing Compact Transformers. We show for the first time that with the right size, convolutional tokenization, transformers can avoid overfitting and outperform state-of-the-art CNNs on small datasets. Our models are flexible in terms of model size, and can have as little as 0.28M parameters while achieving competitive results. Our best model can reach 98% accuracy when training from scratch on CIFAR-10 with only 3.7M parameters, which is a significant improvement in data-efficiency over previous Transformer based models being over 10x smaller than other transformers and is 15% the size of ResNet50 while achieving similar performance. CCT also outperforms many modern CNN based approaches, and even some recent NAS-based approaches. Additionally, we obtain a new SOTA result on Flowers-102 with 99.76% top-1 accuracy, and improve upon the existing baseline on ImageNet (82.71% accuracy with 29% as many parameters as ViT), as well as NLP tasks. Our simple and compact design for transformers makes them more feasible to study for those with limited computing resources and/or dealing with small datasets, while extending existing research efforts in data efficient transformers. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Compact-Transformers.
翻译:随着变压器的崛起成为语言处理的标准,以及计算机视觉的进步,参数大小和培训数据的参数数量也相应增加。许多人相信,因此,变压器不适合小数据集。这一趋势引出了一些关切,例如:某些科学领域的数据有限,以及将资源有限的人排除在外地研究之外。在本文件中,我们的目标是通过引入《契约》变压器来展示小规模学习的方法。我们第一次显示,如果参数大小合适,变压器的象征性化,变压器可以避免超编和超编在小数据集上进行精密的网格参数。许多人相信,因此,变压器不适合于小数据集。由于模型规模灵活,在某些科学领域提供的数据有限,而且没有达到0.28M的参数。我们的最佳模型在CFAR-10的“刮擦”培训中可以达到98%的精度,只有3.7M参数,这比以前的变压器要小10x,变压器比其他变压器要大得多,而ResNet50的大小为15 %,同时实现了类似的性能。CL-102型模型在模型上灵活地扩展了我们的变压器-102,而S-10-L-FlentS的计算方法也比我们更接近了。