Token-based masked generative models are gaining popularity for their fast inference time with parallel decoding. While recent token-based approaches achieve competitive performance to diffusion-based models, their generation performance is still suboptimal as they sample multiple tokens simultaneously without considering the dependence among them. We empirically investigate this problem and propose a learnable sampling model, Text-Conditioned Token Selection (TCTS), to select optimal tokens via localized supervision with text information. TCTS improves not only the image quality but also the semantic alignment of the generated images with the given texts. To further improve the image quality, we introduce a cohesive sampling strategy, Frequency Adaptive Sampling (FAS), to each group of tokens divided according to the self-attention maps. We validate the efficacy of TCTS combined with FAS with various generative tasks, demonstrating that it significantly outperforms the baselines in image-text alignment and image quality. Our text-conditioned sampling framework further reduces the original inference time by more than 50% without modifying the original generative model.
翻译:基于掩码的生成模型由于其并行解码而获得了广泛关注。虽然最近的基于掩码的方法在图像生成质量上已经达到了和扩散式模型同样的水平,但它们同时抽样多个标记而不考虑它们之间的依赖,因此它们的生成性能仍然不够理想。为了解决这个问题,我们在实践中研究了这个问题,并提出了一个可学习的采样模型,文本条件标记选择模型(TCTS),通过局部监督受文本信息启发的方式来选择最优标记。在与给定文本的图像生成上,TCTS不仅提高了生成图像的质量,还提高了生成图像与文本间的语义对齐程度。为了进一步提高生成图像的质量,我们引入了一种一致的采样策略,即适应频率采样。我们将标记分为若干组,然后为每组标记分别做适应频率采样,以提高生成图像的质量。我们的实验结果证明了基于文本条件采样框架的TCTS结合FAS策略在各种图像生成任务上都能显著优于基线模型,在生成图像的质量和图像与文本的语义对齐程度上都取得了重要进展。我们的文本条件采样框架还能将原始推理时间缩短50%以上,而不需要修改原始生成模型。