Businesses generate thousands of documents that communicate their strategic vision and provide details of key products, services, entities, and processes. Knowledge workers then face the laborious task of reading these documents to identify, extract, and synthesize information relevant to their organizational goals. To automate information gathering, question answering (QA) offers a flexible framework where human-posed questions can be adapted to extract diverse knowledge. Finetuning QA systems requires access to labeled data (tuples of context, question, and answer). However, data curation for document QA is uniquely challenging because the context (i.e., answer evidence passage) needs to be retrieved from potentially long, ill-formatted documents. Existing QA datasets sidestep this challenge by providing short, well-defined contexts that are unrealistic in real-world applications. We present a three-stage document QA approach: (1) text extraction from PDF; (2) evidence retrieval from extracted texts to form well-posed contexts; (3) QA to extract knowledge from contexts to return high-quality answers - extractive, abstractive, or Boolean. Using QASPER as a surrogate to our proprietary data, our detect-retrieve-comprehend (DRC) system achieves a +6.25 improvement in Answer-F1 over existing baselines while delivering superior context selection. Our results demonstrate that DRC holds tremendous promise as a flexible framework for practical document QA.
翻译:企业生成数千份文件,传达其战略愿景并提供关键产品、服务、实体和流程的细节; 知识工作者然后面临阅读这些文件以识别、提取和综合与其组织目标有关的信息的艰巨任务; 信息收集自动化、问答(QA)提供了一个灵活的框架,使人源问题能够适应不同的知识。 微调质量A系统需要访问标签数据(背景、问答数),然而,文件质量A的数据整理具有独特的挑战性,因为背景(即回答证据通过)需要从可能长期、格式不完善的文件中检索。 现有的质量A数据集通过提供在现实应用中不切实际的、定义明确的环境来回避这一挑战。 我们提出了一个三阶段性文件质量A办法:(1) 从PDF提取文本;(2) 从提取的文本中提取证据以形成良好的定位环境;(3) 质量A从背景中获取灵活知识,以回复高质量的答案――采掘、抽象或Boolean。 将QASPER用作我们当前自主数据库的升级版本数据, 并同时将我们的现有数据交付到更高性数据库。