When students make a mistake in an exercise, they can consolidate it by ``similar exercises'' which have the same concepts, purposes and methods. Commonly, for a certain subject and study stage, the size of the exercise bank is in the range of millions to even tens of millions, how to find similar exercises for a given exercise becomes a crucial technical problem. Generally, we can assign a variety of explicit labels to the exercise, and then query through the labels, but the label annotation is time-consuming, laborious and costly, with limited precision and granularity, so it is not feasible. In practice, we define ``similar exercises'' as a retrieval process of finding a set of similar exercises based on recall, ranking and re-rank procedures, called the \textbf{FSE} problem (Finding similar exercises). Furthermore, comprehensive representation of the semantic information of exercises was obtained through representation learning. In addition to the reasonable architecture, we also explore what kind of tasks are more conducive to the learning of exercise semantic information from pre-training and supervised learning. It is difficult to annotate similar exercises and the annotation consistency among experts is low. Therefore this paper also provides solutions to solve the problem of low-quality annotated data. Compared with other methods, this paper has obvious advantages in both architecture rationality and algorithm precision, which now serves the daily teaching of hundreds of schools.
翻译:当学生在练习中犯错时,可以通过“相似的练习”巩固所学的概念、目的和方法。通常情况下,对于某个学科和学习阶段,练习题库的大小在数百万甚至数千万的范围内,如何为给定的练习找到相似的练习成为一个至关重要的技术问题。通常情况下,我们可以为练习分配各种明确的标签,并通过这些标签进行查询,但是标记注释费时、费力且成本高,精度和粒度有限,因此并不可行。在实践中,我们将“相似的练习”定义为基于回溯、排名和重新排名过程的检索过程,称为FSE问题(寻找相似的练习)。此外,通过表示学习获得了练习语义信息的综合表示。除合理的架构外,我们还探索了哪些任务更有助于从预训练和有监督学习中学习练习语义信息。注释相似的练习很困难,专家之间的注释一致性很低。因此,本文还提供了解决低质量注释数据问题的解决方案。与其他方法相比,本文在架构合理性和算法精度上具有显著优势,目前服务于数百所学校的日常教学。