Question answering (QA) models often rely on large-scale training datasets, which necessitates the development of a data generation framework to reduce the cost of manual annotations. Although several recent studies have aimed to generate synthetic questions with single-span answers, no study has been conducted on the creation of list questions with multiple, non-contiguous spans as answers. To address this gap, we propose \ours, an automated framework for generating list QA datasets from unlabeled corpora. We first convert a passage from Wikipedia or PubMed into a summary and extract named entities from the summarized text as candidate answers. This allows us to select answers that are semantically correlated in context and is, therefore, suitable for constructing list questions. We then create questions using an off-the-shelf question generator with the extracted entities and original passage. Finally, iterative filtering and answer expansion are performed to ensure the accuracy and completeness of the answers. Using our synthetic data, we significantly improve the performance of the previous best list QA models by exact-match F1 scores of 5.0 on MultiSpanQA, 1.9 on Quoref, and 2.8 averaged across three BioASQ benchmarks.
翻译:问题解答(QA)模型往往依赖大型培训数据集,这就要求开发一个数据生成框架,以减少人工说明的成本。虽然最近的一些研究旨在以单片解答的方式产生合成问题,但没有研究以多个、不相交的宽度解答为解答的列表问题。为填补这一空白,我们提议建立自动框架,用于从无标签的子公司生成清单质量控制数据集。我们首先将维基百科或PubMed的一段内容转换成一个摘要,从摘要文本中抽取指定实体作为候选答案。这使我们能够选择在背景中具有语义相关性并因此适合构建列表问题的答案。然后,我们用抽取实体和原始通道的现外问题生成器来创建问题。最后,我们建议进行迭代过滤和回答扩展,以确保答案的准确性和完整性。我们利用我们的合成数据,我们大大改进了先前最佳清单模型的性能,在多SpanAS上精确匹配5.0的F1分,在多SpanAS上平均为1.9。