In the wake of information overload in academia, methodologies and systems for search, recommendation, and prediction to aid researchers in identifying relevant research are actively studied and developed. Existing work, however, is limited in terms of granularity, focusing only on the level of papers or a single type of artifact, such as data sets. To enable more holistic analyses and systems dealing with academic publications and their content, we propose CoCon, a large scholarly data set reflecting the combined use of research artifacts, contextualized in academic publications' full-text. Our data set comprises 35 k artifacts (data sets, methods, models, and tasks) and 340 k publications. We additionally formalize a link prediction task for "combined research artifact use prediction" and provide code to utilize analyses of and the development of ML applications on our data. All data and code is publicly available at https://github.com/IllDepence/contextgraph.
翻译:面对学术界中的信息过载,为了帮助研究人员识别相关研究文章,人们积极研究和开发搜索、推荐和预测的方法和系统。然而,现有的工作在粒度上存在限制,仅关注于论文或单一类型的文献作品,如数据集。为了实现更全面的学术出版物及其内容分析和系统的开发,我们提出了CoCon,这是一个大型的学术数据集,反映了研究文献中上下文化的组合使用。我们的数据集包括35k的文献作品(数据集、方法、模型和任务)和340k的出版物。此外,我们还规范了“组合研究文献使用预测”的链接预测任务,并提供了用于数据分析和开发ML应用程序的代码。所有数据和代码都在https://github.com/IllDepence/contextgraph上公开。