We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.
翻译:我们提出了一个用于文档问答(QA)的统一数据集,该数据集通过整合多个与文档人工智能(Document AI)和视觉丰富文档理解(VRDU)相关的公共数据集而构建。我们的主要贡献体现在两个方面:一方面,我们将现有的文档AI任务(如信息抽取(IE))重新表述为问答任务,使其成为训练和评估大型语言模型的合适资源;另一方面,我们发布了所有文档的光学字符识别(OCR)结果,并将答案在文档图像中的精确位置以边界框的形式包含在内。利用该数据集,我们探索了不同提示技术(可能包含边界框信息)对开源模型性能的影响,从而识别出文档理解中最有效的方法。