Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-$π$, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-$π$ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-$π$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-$π$ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at https://github.com/Lil-Shake/VA-Pi.
翻译:自回归视觉生成模型依赖分词器将图像映射至离散序列并重建。然而,分词器的训练目标是从真实标记重建清晰图像,而自回归生成器仅针对标记似然性进行优化。这种错配导致生成的标记序列可能解码为低质量图像,且缺乏来自像素空间的直接监督。本文提出VA-π——一种轻量级后训练框架,通过基于原理的像素空间目标直接优化自回归模型。VA-π将生成器-分词器对齐问题构建为变分优化,推导出统一像素重建与自回归建模的证据下界。为在离散标记空间中进行优化,VA-π引入基于强化的对齐策略:将自回归生成器视为策略网络,以像素空间重建质量作为内在奖励。该奖励通过教师强制条件下预测标记序列重建原始图像的能力进行度量,使模型获得直接像素级指导而无需昂贵的自由运行采样。证据下界中的正则化项作为天然正则器,维持标记的分布一致性。VA-π能够快速适配现有自回归生成器,既无需重新训练分词器,也不依赖外部奖励模型。仅使用1%的ImageNet-1K数据及25分钟微调,即在LlamaGen-XXL上将FID从14.36降至7.65、IS从86.55提升至116.70;在GenEval的文本到图像任务中,视觉生成模型(LlamaGen:从0.306至0.339)与统一多模态模型(Janus-Pro:从0.725至0.744)均获得显著提升。代码发布于https://github.com/Lil-Shake/VA-Pi。