基于下一帧预测的学习：自回归视频建模编码高效表示 (Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations)

Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.

翻译：近年来，通用基础模型的预训练进展显著提升了各类下游任务的性能。尽管像GPT这样的自回归生成模型已在自然语言处理领域引发革命，但大多数视觉生成预训练方法仍依赖于BERT风格的掩码建模，而这种方法往往忽略了视频分析所必需的时间信息。现有的少数自回归视觉预训练方法存在语义定位不准确、生成质量差等问题，导致语义表征能力不足。本工作提出NExT-Vid——一种新颖的自回归视觉生成预训练框架，它利用掩码下一帧预测来联合建模图像与视频。NExT-Vid引入了上下文隔离的自回归预测器以解耦语义表征与目标解码，并采用条件流匹配解码器来提升生成质量与多样性。通过上下文隔离的流匹配预训练，我们的方法获得了强大的表征能力。在大规模预训练模型上的大量实验表明，通过在下游分类任务中进行注意力探测评估，我们提出的方法在视觉表征学习方面持续优于以往的生成式预训练方法。