VisualCloze：一种基于视觉上下文学习的通用图像生成框架 (VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning)

Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures.

翻译：扩散模型的最新进展显著推动了各类图像生成任务的发展。然而，当前主流方法仍集中于构建任务专用模型，在支持广泛多样化需求时效率有限。尽管通用模型试图解决这一局限，但其面临关键挑战，包括可泛化的任务指令、适当的任务分布以及统一的架构设计。为应对这些挑战，我们提出VisualCloze——一种通用图像生成框架，支持广泛的域内任务、对未见任务的泛化、多任务的未见统一以及逆向生成。与现有依赖基于语言的任务指令（易导致任务模糊性和弱泛化能力）的方法不同，我们整合了视觉上下文学习，使模型能够通过视觉演示识别任务。同时，视觉任务分布固有的稀疏性阻碍了跨任务可迁移知识的学习。为此，我们引入Graph200K——一个图结构数据集，通过建立多种相互关联的任务来增强任务密度和可迁移知识。此外，我们发现我们统一的图像生成公式与图像修复共享一致的目标，这使得我们能够在不修改架构的情况下，利用预训练修复模型的强大生成先验。