Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision
翻译:视觉变异器将图像转换成序列, 将其切成切片 。 这些补丁的大小控制着速度/ 准确的权衡, 较小的补丁导致更精确, 计算成本更高, 但改变补丁大小通常需要重新校准模型 。 在本文中, 我们证明, 简单的随机调整培训时间的补丁大小, 会导致在不同的补丁大小之间产生一套单一的重力, 使该模型能够在部署时使不同的计算预算适应不同的计算。 我们广泛评价由此产生的模型, 我们称之为 FlexiViT, 用于一系列广泛的任务, 包括分类、 图像文本检索、 开放世界检测、 光学分解和语义分解, 得出的结论是, 它通常匹配, 有时是超光速的。 因此, FlexiViVT 培训对 ViT 来说是一个简单的滴进式改进, 使得我们很容易在使用 ViT 主干结构的多数模型中添加折中适应能力。 代码和预先训练的模型可以在 https://givisifth/gregogogogobus.