Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at huggingface.co/datasets/nianbing/NL-DIR.
翻译:文档图像检索(DIR)旨在根据给定查询从图库中检索文档图像。现有DIR方法主要基于图像查询,检索属于相同粗粒度语义类别(例如报纸或收据)的文档。然而,这些方法在现实场景中难以有效检索文档图像,因为实际查询通常是以细粒度语义的文本形式提供的。为弥补这一差距,我们引入了一个新的基于自然语言的文档图像检索(NL-DIR)基准及其相应的评估指标。在本工作中,自然语言描述作为语义丰富的查询用于DIR任务。NL-DIR数据集包含41K张真实文档图像,每张图像均配有五个通过大语言模型生成并经人工验证的高质量细粒度语义查询。我们对现有主流对比式视觉-语言模型以及免OCR的视觉文档理解(VDU)模型进行了零样本和微调评估。为进一步提升性能并同时实现时间和空间效率,我们还探究了一种两阶段检索方法。我们希望所提出的NL-DIR基准能为VDU领域带来新的机遇并推动相关研究。数据集与代码将在huggingface.co/datasets/nianbing/NL-DIR公开提供。