Question answering (QA) models often rely on large-scale training datasets, which necessitates the development of a data generation framework to reduce the cost of manual annotations. Although several recent studies have aimed to generate synthetic questions with single-span answers, no study has been conducted on the creation of list questions with multiple, non-contiguous spans as answers. To address this gap, we propose LIQUID, an automated framework for generating list QA datasets from unlabeled corpora. We first convert a passage from Wikipedia or PubMed into a summary and extract named entities from the summarized text as candidate answers. This allows us to select answers that are semantically correlated in context and is, therefore, suitable for constructing list questions. We then create questions using an off-the-shelf question generator with the extracted entities and original passage. Finally, iterative filtering and answer expansion are performed to ensure the accuracy and completeness of the answers. Using our synthetic data, we significantly improve the performance of the previous best list QA models by exact-match F1 scores of 5.0 on MultiSpanQA, 1.9 on Quoref, and 2.8 averaged across three BioASQ benchmarks.
翻译:问题解答(QA)模型往往依赖大型培训数据集,这就要求开发一个数据生成框架,以减少人工说明的成本。虽然最近的一些研究旨在以单片解答的方式产生合成问题,但没有研究以多个、不相交的宽度解答为解答的列表问题。为了解决这一差距,我们建议LiQUID(LiQUID),这是一个从未贴标签的子公司生成清单质量控制数据集的自动化框架。我们首先将维基百科或普布麦德的一段内容转换成一个摘要,从摘要文本摘要中抽出一些名称实体作为候选答案。这使我们能够选择在背景中具有语义相关性并因此适合构建列表问题的答案。然后我们用离流的问题生成器与提取的实体和原始通道一起创建问题。最后,将进行迭代过滤和回答扩展,以确保答案的准确性和完整性。我们利用我们的合成数据,通过在多SpanAS上准确匹配的F1分至5.0分、在MiOSA上平均的1.9和在Bioqeal Qbal d bas biralbal d bal-al-de bas bas.