BoundingDocs：一个带有空间标注的文档问答统一数据集 (BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations)

We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.

翻译：我们提出了一个用于文档问答（QA）的统一数据集，该数据集通过整合多个与文档人工智能（Document AI）和视觉丰富文档理解（VRDU）相关的公共数据集而构建。我们的主要贡献体现在两个方面：一方面，我们将现有的文档AI任务（如信息抽取（IE））重新表述为问答任务，使其成为训练和评估大型语言模型的合适资源；另一方面，我们发布了所有文档的光学字符识别（OCR）结果，并将答案在文档图像中的精确位置以边界框的形式包含在内。利用该数据集，我们探索了不同提示技术（可能包含边界框信息）对开源模型性能的影响，从而识别出文档理解中最有效的方法。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《用于代码弱点识别的 LLVM 中间表示》CMU

专知会员服务

14+阅读 · 2022年12月12日

【ECCV2022】UniNet:具有卷积、Transformer和MLP的统一架构搜索

专知会员服务

30+阅读 · 2022年7月15日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

Time2Vec：学习时间的向量表示，Time2Vec: Learning a Vector Representation of Time

专知会员服务

36+阅读 · 2020年5月10日