Out-of-distribution (OOD) learning deals with scenarios in which training and test data follow different distributions. Although general OOD problems have been intensively studied in machine learning, graph OOD is only an emerging area of research. Currently, there lacks a systematic benchmark tailored to graph OOD method evaluation. In this work, we aim at developing an OOD benchmark, known as GOOD, for graphs specifically. We explicitly make distinctions between covariate and concept shifts and design data splits that accurately reflect different shifts. We consider both graph and node prediction tasks as there are key differences in designing shifts. Overall, GOOD contains 11 datasets with 17 domain selections. When combined with covariate, concept, and no shifts, we obtain 51 different splits. We provide performance results on 10 commonly used baseline methods with 10 random runs. This results in 510 dataset-model combinations in total. Our results show significant performance gaps between in-distribution and OOD settings. Our results also shed light on different performance trends between covariate and concept shifts by different methods. Our GOOD benchmark is a growing project and expects to expand in both quantity and variety of resources as the area develops. The GOOD benchmark can be accessed via https://github.com/divelab/GOOD/.
翻译:虽然在机器学习中已经深入研究了一般的OOD问题,但图形 OOD只是一个新的研究领域。目前,缺乏一个系统的基准,专门用于对OOOD方法的评估。在这项工作中,我们的目标是为图表专门开发一个OOOD基准,称为Good。我们明确区分COD基准和概念转移以及设计数据分割,以准确反映不同变化。我们考虑到图表和节点预测任务,因为设计转变存在关键差异。总体而言,Good包含有17个域选择的11个数据集。当与变量、概念和无变化相结合时,我们获得51个不同的拆分。我们提供10个常用基线方法的绩效结果,并随机运行10个。这在510个数据集模式组合中得出了总体结果。我们的结果显示,在分布与OODD设置之间存在显著的绩效差距。我们的结果还揭示了不同方法的COVA和概念变化之间的不同性能趋势。我们的Good基准是一个不断增长的项目,并期望在数量和种类上扩展资源,例如MAD/GOD基准。