Recent years have witnessed the dramatic growth of paper volumes with plenty of new research papers published every day, especially in the area of computer science. How to glean papers worth reading from the massive literature to do a quick survey or keep up with the latest advancement about a specific research topic has become a challenging task. Existing academic search engines such as Google Scholar return relevant papers by individually calculating the relevance between each paper and query. However, such systems usually omit the prerequisite chains of a research topic and cannot form a meaningful reading path. In this paper, we introduce a new task named Reading Path Generation (RPG) which aims at automatically producing a path of papers to read for a given query. To serve as a research benchmark, we further propose SurveyBank, a dataset consisting of large quantities of survey papers in the field of computer science as well as their citation relationships. Each survey paper contains key phrases extracted from its title and multi-level reading lists inferred from its references. Furthermore, we propose a graph-optimization-based approach for reading path generation which takes the relationship between papers into account. Extensive evaluations demonstrate that our approach outperforms other baselines. A Real-time Reading Path Generation System (RePaGer) has been also implemented with our designed model. To the best of our knowledge, we are the first to target this important research problem. Our source code of RePaGer system and SurveyBank dataset can be found on here.
翻译:近些年来,随着每天发表的大量新的研究论文,特别是计算机科学领域的论文,纸卷数量急剧增加,而且每天都出版大量新的研究论文。如何从大量文献中收集值得阅读的论文,以便进行快速调查或跟上有关具体研究专题的最新进展,这已成为一项具有挑战性的任务。现有的学术搜索引擎,如谷歌学者等,通过逐个计算每份论文和查询之间的关联性,退回相关论文。然而,这种系统通常忽略研究专题的先决条件链,无法形成有意义的阅读路径。在本文中,我们引入了名为阅读路径生成的新任务,其目的是为特定查询自动制作论文路径。为了作为研究基准,我们进一步提议建立由大量计算机科学领域的调查论文组成的数据集以及它们的引用关系。每份调查文件都包含从其标题和从参考中推断的多层次阅读列表中摘录的关键短语。此外,我们提出了一种基于图表优化的阅读路径生成方法,将文件之间的关系考虑在内。广泛的评估表明,我们的方法超越了其它基线。为了作为研究基准,我们还提议了计算机搜索数据库中设计了我们所设计的重要数据库。