Understanding generalization and robustness of machine learning models fundamentally relies on assuming an appropriate metric on the data space. Identifying such a metric is particularly challenging for non-Euclidean data such as graphs. Here, we propose a pseudometric for attributed graphs, the Tree Mover's Distance (TMD), and study its relation to generalization. Via a hierarchical optimal transport problem, TMD reflects the local distribution of node attributes as well as the distribution of local computation trees, which are known to be decisive for the learning behavior of graph neural networks (GNNs). First, we show that TMD captures properties relevant to graph classification: a simple TMD-SVM performs competitively with standard GNNs. Second, we relate TMD to generalization of GNNs under distribution shifts, and show that it correlates well with performance drop under such shifts.
翻译:机器学习模型的一般化和稳健性基本上取决于对数据空间的适当度量的假设。 确定这种度量对于图类等非欧裔数据来说特别具有挑战性。 在这里,我们建议了一种配给图的伪度,即树移动器距离(TMD),并研究其与一般化的关系。 通过一个等级最优的运输问题, TMD反映了节点属性的当地分布以及当地计算树的分布,据知这些分布对于图形神经网络(GNNs)的学习行为具有决定性意义。 首先,我们证明TMD捕获了与图表分类有关的属性:一个简单的 TMD-SVM 与标准GNS 具有竞争力的特性。 其次,我们将TMD与分布变换中的GNNs的一般化联系起来,并表明它与这种变换下的性能下降密切相关。