Experimental reproducibility and replicability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorousness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models.
翻译:实验性再生和可复制性是机器学习的关键课题。作者们经常对缺乏科学出版物来提高该领域质量表示关切。最近,图表代表学习领域吸引了广泛的研究界的注意,导致大量工作。因此,已经开发了若干图形神经网络模型,以有效解决图解分类问题。然而,实验程序往往缺乏严格性,很难再复制。为此,我们概述了应当避免的通用做法,以便与最新技术进行公平比较。为了应对这一令人不安的趋势,我们在一个控制的统一框架内进行了47 000多次实验,以重新评价九个共同基准的五个流行模型。此外,通过将GNNS与结构性基线进行比较,我们提供了令人信服的证据,证明在某些数据集上,结构信息尚未被利用。我们认为,这项工作有助于发展图表学习领域,为严格评估图表分类模型提供了非常必要的基础。