Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision
翻译:视觉转换器通过将图像切成补丁并将其转换为序列来进行处理。补丁的大小控制着速度/准确度权衡,较小的补丁会导致更高的准确度但需要更大的计算成本,但是改变补丁的大小通常需要重新训练模型。在这篇论文中,我们证明只需在训练时随机分配补丁的大小,即可生成一组权重,该组权重可在很大程度上适应各种补丁尺寸,使得可以在部署时根据不同的计算预算定制模型。我们对这个模型进行了广泛的评估,称之为FlexiViT,评估覆盖了一系列任务,包括分类、图像-文本检索、开放式识别、全景分割和语义分割。结论是,在完全相同的设置下,FlexiViT 训练的模型通常与仅在单个补丁尺寸上进行训练的 ViT 模型相匹配,有时甚至更优。因此,FlexiViT 训练是 ViT 的一个简单插件,可轻松将计算自适应能力添加到大多数依赖 ViT 骨干架构的模型中。代码和预训练模型可在 https://github.com/google-research/big_vision 中获得。