Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose \textbf{I2I-Bench}, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.
翻译:图像编辑模型正快速发展,然而全面的评估仍然是一项重大挑战。现有的图像编辑基准测试普遍存在任务范围有限、评估维度不足、过度依赖人工标注等问题,这严重制约了其可扩展性和实际应用性。为此,我们提出了\\textbf{I2I-Bench},这是一个面向图像到图像编辑模型的综合性基准测试套件,其特点包括:(一)多样化的任务,涵盖单图像和多图像编辑任务的10个任务类别;(二)全面的评估维度,包含30个解耦且细粒度的评估维度,采用自动化混合评估方法,整合了专用工具和大型多模态模型(LMMs);(三)严格的校准验证,证实了我们的基准评估结果与人类偏好之间的一致性。利用I2I-Bench,我们对众多主流图像编辑模型进行了基准测试,探究了不同维度下编辑模型之间的差距与权衡。我们将开源I2I-Bench的所有组件,以促进未来的研究。