To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Our baseline models perform poorly on this task (55.4%) and significantly lag behind human performance (93.5%).
翻译:为了建立和测试关于长期文件理解的模型,我们引入了Quality,这是一个多选择的QA数据集,带有英文背景段落,平均长度约为5 000个象征物,比典型的当前模型可以处理的时间长得多。与以往关于段落的工作不同,我们的问题由阅读整个段落的撰稿者来撰写和验证,而不是依赖摘要或节录。此外,只有一半的问题可以由在紧凑的时间限制下工作的注解者来回答,这表明抽取和简单搜索不足以持续地运行良好。我们的基线模型在这项工作上表现不佳(55.4%),远远落后于人类绩效(93.5% ) 。