Existing benchmarks for visual document retrieval (VDR) largely overlook non-English languages and the structural complexity of official publications. To address this gap, we introduce SDS KoPub VDR, the first large-scale, public benchmark for retrieving and understanding Korean public documents. The benchmark is built upon 361 real-world documents, including 256 files under the KOGL Type 1 license and 105 from official legal portals, capturing complex visual elements like tables, charts, and multi-column layouts. To establish a reliable evaluation set, we constructed 600 query-page-answer triples. These were initially generated using multimodal models (e.g., GPT-4o) and subsequently underwent human verification to ensure factual accuracy and contextual relevance. The queries span six major public domains and are categorized by the reasoning modality required: text-based, visual-based, and cross-modal. We evaluate SDS KoPub VDR on two complementary tasks: (1) text-only retrieval and (2) multimodal retrieval, which leverages visual features alongside text. This dual-task evaluation reveals substantial performance gaps, particularly in multimodal scenarios requiring cross-modal reasoning, even for state-of-the-art models. As a foundational resource, SDS KoPub VDR enables rigorous and fine-grained evaluation and provides a roadmap for advancing multimodal AI in real-world document intelligence. The dataset is available at https://huggingface.co/datasets/SamsungSDS-Research/SDS-KoPub-VDR-Benchmark.
翻译:现有的视觉文档检索基准大多忽视了非英语语言及官方出版物的结构复杂性。为填补这一空白,我们提出了SDS KoPub VDR,这是首个用于检索和理解韩国公共文档的大规模公开基准。该基准基于361份真实世界文档构建,其中包含256份采用KOGL Type 1许可的文件和105份来自官方法律门户的文档,涵盖了表格、图表和多栏布局等复杂视觉元素。为建立可靠的评估集,我们构建了600组查询-页面-答案三元组。这些数据最初通过多模态模型生成,并经过人工验证以确保事实准确性和上下文相关性。查询涵盖六大公共领域,并根据所需推理模式分为三类:基于文本、基于视觉和跨模态。我们在两项互补任务上评估SDS KoPub VDR:(1)纯文本检索;(2)结合文本与视觉特征的多模态检索。这种双任务评估揭示了显著的性能差距,尤其是在需要跨模态推理的多模态场景中,即使对于最先进的模型也是如此。作为基础资源,SDS KoPub VDR支持严格且细粒度的评估,并为推进现实世界文档智能中的多模态人工智能提供了路线图。数据集可通过https://huggingface.co/datasets/SamsungSDS-Research/SDS-KoPub-VDR-Benchmark获取。