Alloprof:新的法国问答教育数据集及其在信息检索案例研究中的使用 (Alloprof: a new French question-answer education dataset and its use in an information retrieval case study)

Teachers and students are increasingly relying on online learning resources to supplement the ones provided in school. This increase in the breadth and depth of available resources is a great thing for students, but only provided they are able to find answers to their queries. Question-answering and information retrieval systems have benefited from public datasets to train and evaluate their algorithms, but most of these datasets have been in English text written by and for adults. We introduce a new public French question-answering dataset collected from Alloprof, a Quebec-based primary and high-school help website, containing 29 349 questions and their explanations in a variety of school subjects from 10 368 students, with more than half of the explanations containing links to other questions or some of the 2 596 reference pages on the website. We also present a case study of this dataset in an information retrieval task. This dataset was collected on the Alloprof public forum, with all questions verified for their appropriateness and the explanations verified both for their appropriateness and their relevance to the question. To predict relevant documents, architectures using pre-trained BERT models were fine-tuned and evaluated. This dataset will allow researchers to develop question-answering, information retrieval and other algorithms specifically for the French speaking education context. Furthermore, the range of language proficiency, images, mathematical symbols and spelling mistakes will necessitate algorithms based on a multimodal comprehension. The case study we present as a baseline shows an approach that relies on recent techniques provides an acceptable performance level, but more work is necessary before it can reliably be used and trusted in a production setting.

翻译：教师和学生越来越依赖在线学习资源来补充学校提供的学习资源。增加现有资源的广度和深度对于学生来说是一件大事,但只有他们能够找到答案。问答和信息检索系统受益于公共数据集,用于培训和评估其算法,但大多数这类数据集都是由成人编写和为成年人编写的英文文本。我们推出一个新的公开的法语问答数据集,这是一个基于魁北克的初级和高中帮助网站,包含来自10 368名学生的29 349个问题及其在各类学校科目中的解释,其中一半以上的解释包含与其他问题或网站上的2 596个参考网页中的部分链接。我们还在信息检索任务中对这一数据集进行案例研究。在Alloprof公共论坛收集了该数据集,核实了所有问题是否适当,并核实了所有问题是否适当及其与问题的相关性。预测相关文件,在经过事先培训的BERT模型的各类科目中,有29 349个问题及其解释,其中一半以上的解释包含与其它问题或网站上的2 596个参考网页。这些数据将使得研究人员能够对最新数据排序进行数据分析,并具体地分析。