Enabling effective and efficient machine learning (ML) over large-scale graph data (e.g., graphs with billions of edges) can have a huge impact on both industrial and scientific applications. However, community efforts to advance large-scale graph ML have been severely limited by the lack of a suitable public benchmark. For KDD Cup 2021, we present OGB Large-Scale Challenge (OGB-LSC), a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification. Furthermore, OGB-LSC provides dedicated baseline experiments, scaling up expressive graph ML models to the massive datasets. We show that the expressive models significantly outperform simple scalable baselines, indicating an opportunity for dedicated efforts to further improve graph ML at scale. Our datasets and baseline code are released and maintained as part of our OGB initiative (Hu et al., 2020). We hope OGB-LSC at KDD Cup 2021 can empower the community to discover innovative solutions for large-scale graph ML.
翻译:在大型图表数据(例如,具有数十亿边缘的图表)上,使机器学习变得有效和高效(ML)对大型图表数据(例如,具有数十亿边缘的图表)能够对工业和科学应用产生巨大影响;然而,由于缺乏适当的公共基准,社区推进大型图表ML的努力受到严重限制;关于KDD Cup 2021,我们介绍了OGB大型挑战(OGB-LSC),这是三个真实世界数据集的集合,用于推进大型图表ML中最先进的数字。OGB-LSC提供的图表数据集数量大于现有数据数量,涵盖三个核心图表学习任务 -- -- 链接预测、图回归和节点分类。此外,OGB-LSC提供专门的基线实验,将表态图ML模型推广到大型数据集。我们显示,表达模型大大超越了简单的可缩放基线,表明有机会专门努力进一步改进图表ML。我们的数据集和基线代码被发布并维持,作为我们OGBGU社区举措的一部分(Hu等人等人,2020年),我们希望OGB-LSC能够将OGB-LSC的图形模型用于大规模Supal 2021的模型解决方案。