Masked autoregressive models (MAR) have recently emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the potential of continuous tokenizer. However, video MAR models suffer from two major limitations: the slow-start problem, caused by the lack of a structured global prior at early sampling stages, and error accumulation across the autoregression in both spatial and temporal dimensions. In this work, we propose CanvasMAR, a novel video MAR model that mitigates these issues by introducing a canvas mechanism--a blurred, global prediction of the next frame, used as the starting point for masked generation. The canvas provides global structure early in sampling, enabling faster and more coherent frame synthesis. Furthermore, we introduce compositional classifier-free guidance that jointly enlarges spatial (canvas) and temporal conditioning, and employ noise-based canvas augmentation to enhance robustness. Experiments on the BAIR and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality videos with fewer autoregressive steps. Our approach achieves remarkable performance among autoregressive models on Kinetics-600 dataset and rivals diffusion-based methods.
翻译:掩码自回归模型(MAR)最近已成为图像和视频生成的重要范式,它结合了掩码建模的灵活性与连续分词器的潜力。然而,视频MAR模型存在两个主要局限:一是由早期采样阶段缺乏结构化全局先验引起的“慢启动”问题;二是在空间和时间维度上自回归过程中的误差累积。本文提出CanvasMAR,一种新颖的视频MAR模型,通过引入画布机制——即对下一帧的模糊全局预测,并将其作为掩码生成的起点——来缓解这些问题。画布在采样早期提供全局结构,从而实现更快、更连贯的帧合成。此外,我们提出了组合式无分类器引导,以联合增强空间(画布)和时间条件作用,并采用基于噪声的画布增强来提高鲁棒性。在BAIR和Kinetics-600基准上的实验表明,CanvasMAR能以更少的自回归步骤生成高质量视频。我们的方法在Kinetics-600数据集上的自回归模型中取得了显著性能,并与基于扩散的方法相媲美。