Vision transformers have recently achieved competitive results across various vision tasks but still suffer from heavy computation costs when processing a large number of tokens. Many advanced approaches have been developed to reduce the total number of tokens in large-scale vision transformers, especially for image classification tasks. Typically, they select a small group of essential tokens according to their relevance with the class token, then fine-tune the weights of the vision transformer. Such fine-tuning is less practical for dense prediction due to the much heavier computation and GPU memory cost than image classification. In this paper, we focus on a more challenging problem, i.e., accelerating large-scale vision transformers for dense prediction without any additional re-training or fine-tuning. In response to the fact that high-resolution representations are necessary for dense prediction, we present two non-parametric operators, a token clustering layer to decrease the number of tokens and a token reconstruction layer to increase the number of tokens. The following steps are performed to achieve this: (i) we use the token clustering layer to cluster the neighboring tokens together, resulting in low-resolution representations that maintain the spatial structures; (ii) we apply the following transformer layers only to these low-resolution representations or clustered tokens; and (iii) we use the token reconstruction layer to re-create the high-resolution representations from the refined low-resolution representations. The results obtained by our method are promising on five dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation.
翻译:视觉变压器最近在各种视觉任务中取得了竞争性结果,但在处理大量象征物时仍然面临高昂的计算成本。许多先进的方法已经开发,以减少大型视觉变压器中的批量,特别是图像分类任务。通常,它们根据与类象征物的相关性选择一小组基本批量,然后微调视觉变压器的权重。由于计算和GPU记忆成本比图像分类要重得多,这种微调对于密集预测来说不太实用。在本文中,我们侧重于一个更具挑战性的问题,即加快大型视觉变压器,用于密集的预测,而无需再培训或微调。鉴于高分辨率表示是密集预测所必需的,我们提出两个非参数组群集,一个象征性的组块来减少代号的数量,一个象征性变压层来增加代号的数量。为了达到这个目的,我们采取了以下步骤:(一) 使用符号组合层将近似值符号集中在一起,结果只是低分辨率表示保持空间结构、低分辨率结构;(二) 将低分辨率表示用于低分辨率结构,包括高分辨率表层。