多次退出：加速统一视觉语言模型的动态早期退出 (You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model)

Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. The performance improvements come with increasing model size, resulting in slow inference speed and increased cost for severing. While some certain predictions benefit from the full complexity of the large-scale model, not all of inputs need the same amount of computation to conduct, potentially leading to computation resource waste. To handle this challenge, early exiting is proposed to adaptively allocate computational power in term of input complexity to improve inference efficiency. The existing early exiting strategies usually adopt output confidence based on intermediate layers as a proxy of input complexity to incur the decision of skipping following layers. However, such strategies cannot apply to encoder in the widely-used unified architecture with both encoder and decoder due to difficulty of output confidence estimation in the encoder. It is suboptimal in term of saving computation power to ignore the early exiting in encoder component. To handle this challenge, we propose a novel early exiting strategy for unified visual language models, which allows dynamically skip the layers in encoder and decoder simultaneously in term of input layer-wise similarities with multiple times of early exiting, namely \textbf{MuE}. By decomposing the image and text modalities in the encoder, MuE is flexible and can skip different layers in term of modalities, advancing the inference efficiency while minimizing performance drop. Experiments on the SNLI-VE and MS COCO datasets show that the proposed approach MuE can reduce expected inference time by up to 50\% and 40\% while maintaining 99\% and 96\% performance respectively.

翻译：大规模 Transformer 模型采用统一架构为各种下游视觉语言任务带来了显著的改进。性能提升伴随着模型大小的增加，导致推理速度缓慢并增加了服务器成本。虽然某些特定的预测受益于大规模模型的全部复杂性，但并不是所有输入都需要相同量级的计算。这可能导致计算资源浪费。为了解决这个挑战，提出了早期退出策略，根据输入的复杂性自适应地分配计算资源，以提高推理效率。现有的早期退出策略通常采用中间层输出置信度作为输入复杂度的代理, 以确定是否跳过后续层。然而，由于难以在编码器中估计输出置信度，这些策略不能适用于同时具有编码器和解码器的广泛使用的统一架构。忽略编码器组件中的早期退出是节省计算资源的次优解。为了解决这个挑战，我们提出了一种新的统一视觉语言模型早期退出策略，该策略允许根据输入层间相似性在编码器和解码器中同时跳过层，具有多次早期退出，即 \textbf{MuE}。通过将图像和文本模态分解在编码器中，MuE 是灵活的，并且可以根据模态跳过不同的层，提高推理效率同时最小化性能下降。 SNLI-VE 和 MS COCO 数据集上的实验表明，所提出的 MuE 方法可以将预期推理时间缩短高达 50% 和 40%，而保持 99% 和 96% 的性能。