从序列到空间：重排序自回归以实现高效视觉生成 (From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation)

Inspired by the remarkable success of autoregressive models in language modeling, this paradigm has been widely adopted in visual generation. However, the sequential token-by-token decoding mechanism inherent in traditional autoregressive models leads to low inference efficiency.In this paper, we propose RadAR, an efficient and parallelizable framework designed to accelerate autoregressive visual generation while preserving its representational capacity. Our approach is motivated by the observation that visual tokens exhibit strong local dependencies and spatial correlations with their neighbors--a property not fully exploited in standard raster-scan decoding orders. Specifically, we organize the generation process around a radial topology: an initial token is selected as the starting point, and all other tokens are systematically grouped into multiple concentric rings according to their spatial distances from this center. Generation then proceeds in a ring-wise manner, from inner to outer regions, enabling the parallel prediction of all tokens within the same ring. This design not only preserves the structural locality and spatial coherence of visual scenes but also substantially increases parallelization. Furthermore, to address the risk of inconsistent predictions arising from simultaneous token generation with limited context, we introduce a nested attention mechanism. This mechanism dynamically refines implausible outputs during the forward pass, thereby mitigating error accumulation and preventing model collapse. By integrating radial parallel prediction with dynamic output correction, RadAR significantly improves generation efficiency.

翻译：受自回归模型在语言建模领域取得显著成功的启发，该范式已被广泛应用于视觉生成。然而，传统自回归模型固有的逐令牌顺序解码机制导致推理效率低下。本文提出RadAR，一个高效且可并行化的框架，旨在加速自回归视觉生成，同时保持其表征能力。我们的方法基于以下观察：视觉令牌与其相邻令牌之间存在强烈的局部依赖性和空间相关性——这一特性在标准光栅扫描解码顺序中未得到充分利用。具体而言，我们围绕径向拓扑组织生成过程：选择一个初始令牌作为起点，所有其他令牌根据其与该中心的空间距离被系统地分组到多个同心环中。生成随后以环为单位进行，从内环到外环，从而实现对同一环内所有令牌的并行预测。这种设计不仅保留了视觉场景的结构局部性和空间连贯性，还大幅提高了并行化程度。此外，为了应对在有限上下文条件下同时生成令牌可能带来的预测不一致风险，我们引入了嵌套注意力机制。该机制在前向传播过程中动态修正不合理的输出，从而减轻错误累积并防止模型崩溃。通过将径向并行预测与动态输出修正相结合，RadAR显著提升了生成效率。