Out-of-distribution (OOD) learning deals with scenarios in which training and test data follow different distributions. Although general OOD problems have been intensively studied in machine learning, graph OOD is only an emerging area of research. Currently, there lacks a systematic benchmark tailored to graph OOD method evaluation. In this work, we aim at developing an OOD benchmark, known as GOOD, for graphs specifically. We explicitly make distinctions between covariate and concept shifts and design data splits that accurately reflect different shifts. We consider both graph and node prediction tasks as there are key differences when designing shifts. Overall, GOOD contains 8 datasets with 14 domain selections. When combined with covariate, concept, and no shifts, we obtain 42 different splits. We provide performance results on 7 commonly used baseline methods with 10 random runs. This results in 294 dataset-model combinations in total. Our results show significant performance gaps between in-distribution and OOD settings. Our results also shed light on different performance trends between covariate and concept shifts by different methods. Our GOOD benchmark is a growing project and expects to expand in both quantity and variety of resources as the area develops. The GOOD benchmark can be accessed via $\href{https://github.com/divelab/GOOD/}{\text{https://github.com/divelab/GOOD/}}$.
翻译:虽然在机器学习中已经深入研究了一般OOD问题,但图 OOD只是一个新的研究领域。目前,缺乏一个系统化的基准,用于图形OOOD方法评价。在这项工作中,我们的目标是为图表专门开发一个称为“良好”的OOOD基准。我们明确区分COD基准和概念转移以及设计数据分割,准确反映不同变化。我们考虑到图表和节点预测任务,因为在设计转变时存在关键差异。总体而言,Good包含有14个域选择的8个数据集。当与变量、概念和无变化相结合时,我们获得42个不同的拆分。我们提供7个常用基线方法的绩效结果,有10个随机运行。这在294个数据集模式组合中得出了总体结果。我们的结果显示,在分布与OODD设置之间存在显著的绩效差距。我们的结果还揭示了不同方法的COV和概念变化之间的不同性能趋势。我们的Good基准是一个不断增长的项目,并期望在数量和种类上扩展资源,如正在开发的 AL-AFF/GOD/GOD/GOB/GR 。 GRAB/GR 。