The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. More tokens normally lead to better performance but considerably increased computational cost. Motivated by the proverb "A picture is worth a thousand words" we aim to accelerate the ViT model by making a long image short. To this end, we propose a novel approach to assign token length adaptively during inference. Specifically, we first train a ViT model, called Resizable-ViT (ReViT), that can process any given input with diverse token lengths. Then, we retrieve the "token-length label" from ReViT and use it to train a lightweight Token-Length Assigner (TLA). The token-length labels are the smallest number of tokens to split an image that the ReViT can make the correct prediction, and TLA is learned to allocate the optimal token length based on these labels. The TLA enables the ReViT to process the image with the minimum sufficient number of tokens during inference. Thus, the inference speed is boosted by reducing the token numbers in the ViT model. Our approach is general and compatible with modern vision transformer architectures and can significantly reduce computational expanse. We verified the effectiveness of our methods on multiple representative ViT models (DeiT, LV-ViT, and TimesFormer) across two tasks (image classification and action recognition).
翻译:视觉变压器将每个图像分割成一个固定长度的象征物序列, 处理符号的方式与自然语言处理中的文字相同。 更多的象征物通常会提高性能, 但会大大增加计算成本。 我们以“ 图片值一千字” 的谚语为动力, 我们的目标是通过长图像短来加速 ViT 模型。 为此, 我们提出一种新颖的方法, 在推断期间根据情况调整分配象征物长度。 具体地说, 我们首先训练一个 ViT 模型, 叫做Resposable- VIT (REVT), 它可以以不同象征物长度处理任何输入。 然后, 我们从 ReViT 中检索“ 标记长度”, 并用它来训练一个轻量的 Token- Length 指派者( TLA) 。 我们的象征性标签标签标签标签标签是最小的象征物, 并且通过我们普通的变压图解模型的比值, 将我们普通的图象化图案的图象化方法 和图象的比的比值化方法 降低了我们普通图象的图象的图象的图象值的图象的图象的图象的比。 。 和图象化的比的比比比比的比的比的比的比和比的比的比的比, 和比的比, 可以的比的比, 和比的比的比的比的比的比的比的比的比的比的比, 和比的比的比的比的比的比的比的比的比的比,可以大大的比的比和比的比的比的比的比的比的比。