ThinkingViT：用于弹性推理的套娃式思维视觉Transformer (ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference)

ViTs deliver SOTA performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent Matryoshka-style Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT first activates a small subset of the most important attention heads to produce an initial prediction. If the prediction confidence exceeds a predefined threshold, inference terminates early. Otherwise, within the same backbone, it activates a larger subset of attention heads and conducts a new forward pass. This process continues iteratively until the model reaches the predefined confidence level or exhausts its maximum capacity. To boost the performance of subsequent rounds, we introduce a Token Recycling approach that fuses the input embeddings with the embeddings from the previous stage. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K. We show that the backbone-preserving design of ThinkingViT allows it to serve as a plug-in upgrade for ViTs in downstream tasks such as semantic segmentation. We also demonstrate that ThinkingViT transfers effectively to other architectures such as Swin. The source code is available at https://github.com/ds-kiel/ThinkingViT.

翻译：视觉Transformer（ViT）实现了最先进的性能，但其固定的计算预算阻碍了在异构硬件上的可扩展部署。最近的套娃式Transformer架构通过将嵌套子网络嵌入单个模型中来实现可扩展推理，从而缓解了这一问题。然而，这些模型为所有输入分配相同的计算量，而忽略其复杂性，导致效率低下。为解决此问题，我们提出了ThinkingViT，一种嵌套的ViT架构，它采用渐进式思维阶段，根据输入难度动态调整推理计算。ThinkingViT首先激活一小部分最重要的注意力头以产生初始预测。如果预测置信度超过预设阈值，则推理提前终止。否则，在同一骨干网络中，它会激活更大一部分注意力头并进行新的前向传播。此过程迭代进行，直到模型达到预设置信度或耗尽最大计算容量。为提升后续轮次的性能，我们引入了Token Recycling方法，将输入嵌入与前一阶段的嵌入进行融合。实验表明，在相同吞吐量下，ThinkingViT在ImageNet-1K上的准确率比嵌套基线模型高出最多2.0个百分点（p.p.）；在相同GMACs下，准确率高出最多2.9个百分点。我们证明了ThinkingViT的骨干保留设计使其可作为ViT在下游任务（如语义分割）中的即插即用升级方案。我们还展示了ThinkingViT能有效迁移至其他架构（如Swin Transformer）。源代码发布于https://github.com/ds-kiel/ThinkingViT。