Alloprof: 一个新的教育领域法语问答数据集及其在信息检索的应用 (Alloprof: a new French question-answer education dataset and its use in an information retrieval case study)

Teachers and students are increasingly relying on online learning resources to supplement the ones provided in school. This increase in the breadth and depth of available resources is a great thing for students, but only provided they are able to find answers to their queries. Question-answering and information retrieval systems have benefited from public datasets to train and evaluate their algorithms, but most of these datasets have been in English text written by and for adults. We introduce a new public French question-answering dataset collected from Alloprof, a Quebec-based primary and high-school help website, containing 29 349 questions and their explanations in a variety of school subjects from 10 368 students, with more than half of the explanations containing links to other questions or some of the 2 596 reference pages on the website. We also present a case study of this dataset in an information retrieval task. This dataset was collected on the Alloprof public forum, with all questions verified for their appropriateness and the explanations verified both for their appropriateness and their relevance to the question. To predict relevant documents, architectures using pre-trained BERT models were fine-tuned and evaluated. This dataset will allow researchers to develop question-answering, information retrieval and other algorithms specifically for the French speaking education context. Furthermore, the range of language proficiency, images, mathematical symbols and spelling mistakes will necessitate algorithms based on a multimodal comprehension. The case study we present as a baseline shows an approach that relies on recent techniques provides an acceptable performance level, but more work is necessary before it can reliably be used and trusted in a production setting.

翻译：老师和学生越来越依赖在线学习资源补充学校提供的资源。这增加了可用资源的广度和深度，但前提是学生能找到他们的查询答案。问答和信息检索系统已经受益于公共数据集来训练和评估它们的算法，但是这些数据集中大多数的英文文本是由成年人撰写的。我们介绍了一个新的公共法语问答数据集Alloprof，它是从加拿大魁北克省的一个小学和高中网站收集的，包含了29349个问题及其在各个学科上的学生解释。有10368名学生提供了这些解释，其中超过一半的解释包含指向其他问题或该网站上2596个参考页面的链接。我们还在信息检索任务中使用了这个数据集的实例研究。这个数据集是从Alloprof公共论坛收集而来的，所有问题都经过审核以确保其合适性，解释也都经过审核，以确保其与问题相关和合适。为了预测相关文档，使用预训练Bert模型 Fine-tuning并进行了评估。这个数据集将允许研究人员为法语教育背景专门开发问答、信息检索和其他算法。此外，语言能力范围、图像、数学符号和拼写错误将需要基于多模式理解的算法。我们作为基准线呈现的案例研究表明，依赖于最近的技术的方法提供了可接受的性能水平，但在它可以可靠地在实际设置中使用和信任之前，还需要更多的工作。