Deep generative models have shown impressive results in text-to-image synthesis. However, current text-to-image models often generate images that are inadequately aligned with text prompts. We propose a fine-tuning method for aligning such models using human feedback, comprising three stages. First, we collect human feedback assessing model output alignment from a set of diverse text prompts. We then use the human-labeled image-text dataset to train a reward function that predicts human feedback. Lastly, the text-to-image model is fine-tuned by maximizing reward-weighted likelihood to improve image-text alignment. Our method generates objects with specified colors, counts and backgrounds more accurately than the pre-trained model. We also analyze several design choices and find that careful investigations on such design choices are important in balancing the alignment-fidelity tradeoffs. Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.
翻译:深重的基因模型在文本到图像合成中显示出了令人印象深刻的结果。 但是,当前的文本到图像模型往往产生与文本提示不完全一致的图像。 我们建议了一种微调方法,用人类反馈来调整这些模型,包括三个阶段。 首先,我们收集人类反馈,从一组不同的文本提示中评估模型输出对齐。 然后,我们使用人类标签图像文本数据集来培训一种奖励功能,预测人类反馈。 最后,文本到图像模型通过最大限度地利用奖励加权的可能性来改进图像文本对齐。 我们的方法生成的颜色、计数和背景比预先培训的模型更精确的物体。 我们还分析了几种设计选择,发现对此类设计选择的仔细调查对于平衡一致性-不定性权衡至关重要。 我们的结果表明,从人类反馈中学习以显著改进文本到图像模型的潜力。