Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.
翻译:直接建模原始数据分布的显式似然是机器学习领域的关键课题,其通过自回归建模在大型语言模型中实现了规模化成功。然而,针对视觉像素数据的连续自回归建模面临序列极长和高维空间的挑战。本文提出FARMER,一种新颖的端到端生成框架,将标准化流与自回归模型统一起来,实现从原始像素直接进行可处理的似然估计和高质量图像合成。FARMER采用可逆自回归流将图像转换为潜在序列,其分布通过自回归模型进行隐式建模。为解决像素级建模中的冗余性和复杂性,我们提出一种自监督降维方案,将标准化流潜在通道划分为信息组和冗余组,从而实现更有效和高效的自回归建模。此外,我们设计了一步蒸馏方案以显著加速推理速度,并引入基于重采样的无分类器引导算法以提升图像生成质量。大量实验表明,FARMER在提供精确似然和可扩展训练的同时,相比现有基于像素的生成模型取得了具有竞争力的性能。