Recent analyses of self-supervised learning (SSL) find the following data-centric properties to be critical for learning good representations: invariance to task-irrelevant semantics, separability of classes in some latent space, and recoverability of labels from augmented samples. However, given their discrete, non-Euclidean nature, graph datasets and graph SSL methods are unlikely to satisfy these properties. This raises the question: how do graph SSL methods, such as contrastive learning (CL), work well? To systematically probe this question, we perform a generalization analysis for CL when using generic graph augmentations (GGAs), with a focus on data-centric properties. Our analysis yields formal insights into the limitations of GGAs and the necessity of task-relevant augmentations. As we empirically show, GGAs do not induce task-relevant invariances on common benchmark datasets, leading to only marginal gains over naive, untrained baselines. Our theory motivates a synthetic data generation process that enables control over task-relevant information and boasts pre-defined optimal augmentations. This flexible benchmark helps us identify yet unrecognized limitations in advanced augmentation techniques (e.g., automated methods). Overall, our work rigorously contextualizes, both empirically and theoretically, the effects of data-centric properties on augmentation strategies and learning paradigms for graph SSL.
翻译:最近对自我监督学习(SSL)的分析发现,以下以数据为中心的特性对于学习良好表现至关重要:在与任务相关的语义学、某些潜空空间的分类分离和标签从增强的样本中可恢复。然而,鉴于这些特性的离散性、非欧元性质、图表数据集和图表 SSL 方法不可能满足这些特性。这提出了这样一个问题:图表 SLS 方法,例如对比性学习(CL)如何运作良好?为了系统地探讨这一问题,我们在使用通用图形增强(GGAAs)时对 CL 进行概括化分析,重点是以数据为中心的特性。我们的分析对GGGAs的局限性和任务相关增强的必要性得出了正式的洞察力。我们的经验显示,GGGAGs没有在共同的基准数据集上产生与任务相关的差异,导致在天性、未经培训的基线上只取得边际收益。我们的理论激励了一个合成数据生成过程,从而能够控制任务相关的信息,并口述预定义的最佳增强性增强性能。这种灵活的基准有助于我们识别了GGGGS的深度研究策略的深度研究策略。