Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
翻译:文字到图像生成传统上侧重于为固定数据集的培训寻找更好的模型假设,这些假设可能涉及复杂的结构、附带损失或侧面信息,如培训期间提供的物体部件标签或隔段遮罩。我们描述这项任务的一个简单方法,其基础是自动递反地将文字和图像符号作为单一数据流模型。有了充足的数据和规模,我们的方法在以零速方式评价以前特定域模型时具有竞争力。