VisChainBench：一个超越语言先验的多轮次、多图像视觉推理基准 (VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors)

Understanding multi-image, multi-turn scenarios is a critical yet underexplored capability for Large Vision-Language Models (LVLMs). Existing benchmarks predominantly focus on static or horizontal comparisons -- e.g., spotting visual differences or assessing appropriateness -- while relying heavily on language cues. Such settings overlook progressive, context-dependent reasoning and the challenge of visual-to-visual inference. To bridge this gap, we present VisChainBench, a large-scale benchmark designed to rigorously evaluate LVLMs' ability to perform multi-step visual reasoning across sequential, interdependent tasks with minimal language guidance. VisChainBench contains 1,457 tasks spanning over 20,000 images across three diverse domains (e.g., daily scenarios, engineering troubleshooting), structured to mimic real-world decision-making processes. Uniquely, the benchmark is constructed using a multi-agent generation pipeline, ensuring high visual diversity and controlled language bias. All the benchmark data and code for benchmark construction are available for viewing and download via following Link: https://huggingface.co/datasets/eyehole/VisChainBench

翻译：理解多图像、多轮次场景是大型视觉语言模型（LVLMs）一项关键但尚未充分探索的能力。现有基准主要关注静态或横向比较——例如识别视觉差异或评估适当性——同时严重依赖语言线索。此类设置忽视了渐进式、上下文依赖的推理以及视觉到视觉推断的挑战。为弥补这一差距，我们提出了VisChainBench，这是一个大规模基准，旨在严格评估LVLMs在最小语言指导下跨序列化、相互依赖任务执行多步骤视觉推理的能力。VisChainBench包含1,457个任务，涵盖三个多样化领域（如日常场景、工程故障排除）的超过20,000张图像，其结构设计旨在模拟真实世界的决策过程。该基准的独特之处在于采用多智能体生成流水线构建，确保了高视觉多样性和受控的语言偏差。所有基准数据及基准构建代码均可通过以下链接查看和下载：https://huggingface.co/datasets/eyehole/VisChainBench