BoundingDocs：一个带有空间标注的统一文档问答数据集 (BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations)

We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.

翻译：我们提出了一个用于文档问答的统一数据集，该数据集通过整合多个与文档人工智能及视觉丰富文档理解相关的公开数据集构建而成。我们的主要贡献体现在两个方面：一方面，我们将信息抽取等现有文档人工智能任务重新表述为问答任务，使其成为训练和评估大型语言模型的合适资源；另一方面，我们发布了所有文档的光学字符识别结果，并以边界框形式标注了答案在文档图像中的精确位置。利用该数据集，我们探究了不同提示技术（可能包含边界框信息）对开源模型性能的影响，从而识别出文档理解中最有效的方法。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《用于代码弱点识别的 LLVM 中间表示》CMU

专知会员服务

14+阅读 · 2022年12月12日

【NAACL2021】信息解缠正则化持续学习的文本分类

专知会员服务

22+阅读 · 2021年4月11日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

Time2Vec：学习时间的向量表示，Time2Vec: Learning a Vector Representation of Time

专知会员服务

36+阅读 · 2020年5月10日