Effective human learning depends on a wide selection of educational materials that align with the learner's current understanding of the topic. While the Internet has revolutionized human learning or education, a substantial resource accessibility barrier still exists. Namely, the excess of online information can make it challenging to navigate and discover high-quality learning materials. In this paper, we propose the educational resource discovery (ERD) pipeline that automates web resource discovery for novel domains. The pipeline consists of three main steps: data collection, feature extraction, and resource classification. We start with a known source domain and conduct resource discovery on two unseen target domains via transfer learning. We first collect frequent queries from a set of seed documents and search on the web to obtain candidate resources, such as lecture slides and introductory blog posts. Then we introduce a novel pretrained information retrieval deep neural network model, query-document masked language modeling (QD-MLM), to extract deep features of these candidate resources. We apply a tree-based classifier to decide whether the candidate is a positive learning resource. The pipeline achieves F1 scores of 0.94 and 0.82 when evaluated on two similar but novel target domains. Finally, we demonstrate how this pipeline can benefit an application: leading paragraph generation for surveys. This is the first study that considers various web resources for survey generation, to the best of our knowledge. We also release a corpus of 39,728 manually labeled web resources and 659 queries from NLP, Computer Vision (CV), and Statistics (STATS).
翻译:有效的人类学习取决于广泛选择与学习者目前对这个主题的理解相一致的教育材料。虽然互联网使人类学习或教育发生了革命性革命性的变化,但是仍然存在着巨大的资源无障碍障碍。即,在线信息的过剩使得导航和发现高质量的学习材料具有挑战性。在本论文中,我们提出将网络资源发现自动化用于新领域的教育资源管道。管道由三个主要步骤组成:数据收集、地物提取和资源分类。我们从已知的来源域开始,通过传输学习在两个隐蔽的目标域进行资源发现。我们首先从一组种子文件中收集频繁的查询,然后在网上搜索以获取候选资源,例如演讲幻灯片和介绍性博客文章。然后我们推出一个新的预先培训的信息检索深度神经网络模型、查询文件的隐蔽语言模型(QD-MLM),以提取这些候选资源的深度特征。我们用树本级分类方法来决定候选人是否为积极的学习资源。我们从一个名为F1分数 0.94和0.82,在对两个类似但具有新意的目标域进行评价时,我们从网络上搜索的39号数据库的检索中,我们又将展示出一个最佳的版图资源。最后的版本。我们研究的版本的版本的版本的版本的版本的版本资源是如何研究。