Incorporating human feedback has been shown to be crucial to align text generated by large language models to human preferences. We hypothesize that state-of-the-art instructional image editing models, where outputs are generated based on an input image and an editing instruction, could similarly benefit from human feedback, as their outputs may not adhere to the correct instructions and preferences of users. In this paper, we present a novel framework to harness human feedback for instructional visual editing (HIVE). Specifically, we collect human feedback on the edited images and learn a reward function to capture the underlying user preferences. We then introduce scalable diffusion model fine-tuning methods that can incorporate human preferences based on the estimated reward. Besides, to mitigate the bias brought by the limitation of data, we contribute a new 1M training dataset, a 3.6K reward dataset for rewards learning, and a 1K evaluation dataset to boost the performance of instructional image editing. We conduct extensive empirical experiments quantitatively and qualitatively, showing that HIVE is favored over previous state-of-the-art instructional image editing approaches by a large margin.
翻译:将人类反馈吸纳进来被证明对使文本生成的大型语言模型与人类偏好保持一致至关重要。我们假设,最先进的教学图像编辑模型,其输出是基于输入图像和编辑指令生成的,同样可以从人类反馈中获益,因为它们的输出可能没有遵循用户的正确指令和偏好。在本文中,我们提出了一种利用人类反馈进行教学视觉编辑的新框架(HIVE)。具体而言,我们在已编辑过的图像上收集人类反馈,并学习一个奖励函数,以捕捉潜在的用户偏好。我们引入了可扩展的扩散模型微调方法,可以基于估计的奖励来合并人类偏好。此外,为了缓解数据限制带来的偏差,我们贡献了一个新的100万训练数据集,一个3.6K奖励数据集用于奖励学习,以及一个1K评估数据集,以提高教学图像编辑的性能。我们进行了广泛的实证实验,定量和定性地展示HIVE大幅优于以前的最先进的教学图像编辑方法。