Recently proposed systems for open-domain question answering (OpenQA) require large amounts of training data to achieve state-of-the-art performance. However, data annotation is known to be time-consuming and therefore expensive to acquire. As a result, the appropriate datasets are available only for a handful of languages (mainly English and Chinese). In this work, we introduce and publicly release PolQA, the first Polish dataset for OpenQA. It consists of 7,000 questions, 87,525 manually labeled evidence passages, and a corpus of over 7,097,322 candidate passages. Each question is classified according to its formulation, type, as well as entity type of the answer. This resource allows us to evaluate the impact of different annotation choices on the performance of the QA system and propose an efficient annotation strategy that increases the passage retrieval performance by 10.55 p.p. while reducing the annotation cost by 82%.
翻译:最近提议的开放域解答系统(OpenQA)需要大量的培训数据才能达到最先进的性能,然而,已知数据注释费时费时,因此获取费用昂贵。因此,只有少数语言(主要是英语和汉语)才能获得适当的数据集。在这项工作中,我们引入并公开发布开放域解答(OpenQA)的第一个波兰数据集PolQA。它由7 000个问题、87 525个人工标记的证据通道和超过7 097 322个候选通道组成。每个问题都根据其表述、类型和答案的实体类型进行分类。这一资源使我们能够评估不同批注选择对QA系统性能的影响,并提出高效的批注战略,在10.55 p.p.的基础上提高通道检索性能,同时将批注费用降低82%。