GRAFT is a structured multimodal benchmark for evaluating models on instruction-following, visual reasoning, and visual-textual alignment tasks. It features programmatically generated charts and synthetically rendered tables, created with Python visualization libraries to ensure control over data semantics, structure, and clarity. Each GRAFT instance pairs a chart or table image with a systematically generated, multi-step analytical question based solely on visual content. Answers are provided in structured formats such as JSON or YAML, supporting consistent evaluation of both reasoning and output format. The benchmark introduces a taxonomy of reasoning types including comparison, trend identification, ranking, aggregation, proportion estimation, and anomaly detection to enable comprehensive assessment. Reference answers follow strict factual and formatting guidelines for precise, aspect-based evaluation. GRAFT offers a unified, scalable framework for fine-grained benchmarking of multimodal models on visually grounded, structured reasoning tasks, setting a new evaluation standard in this field.
翻译:GRAFT是一个结构化多模态基准,用于评估模型在指令遵循、视觉推理以及视觉-文本对齐任务上的性能。该基准采用Python可视化库程序化生成图表与合成渲染表格,以确保对数据语义、结构及清晰度的精确控制。每个GRAFT实例将图表或表格图像与完全基于视觉内容系统生成的多步骤分析问题配对。答案以JSON或YAML等结构化格式提供,支持对推理过程和输出格式的一致性评估。该基准引入了包含比较、趋势识别、排序、聚合、比例估算及异常检测在内的推理类型分类体系,以实现全面评估。参考答案遵循严格的事实性与格式规范,支持基于维度的精确评估。GRAFT为多模态模型在视觉基础结构化推理任务上的细粒度基准测试提供了统一、可扩展的框架,为该领域确立了新的评估标准。