It is common practice of the outlier mining community to repurpose classification datasets toward evaluating various detection models. To that end, often a binary classification dataset is used, where samples from one of the classes is designated as the inlier samples, and the other class is substantially down-sampled to create the ground-truth outlier samples. Graph-level outlier detection (GLOD) is rarely studied but has many potentially influential real-world applications. In this study, we identify an intriguing issue with repurposing graph classification datasets for GLOD. We find that ROC-AUC performance of the models changes significantly (flips from high to very low, even worse than random) depending on which class is down-sampled. Interestingly, ROC-AUCs on these two variants approximately sum to 1 and their performance gap is amplified with increasing propagations for a certain family of propagation based outlier detection models. We carefully study the graph embedding space produced by propagation based models and find two driving factors: (1) disparity between within-class densities which is amplified by propagation, and (2)overlapping support (mixing of embeddings) across classes. We also study other graph embedding methods and downstream outlier detectors, and find that the intriguing performance flip issue still widely exists but which version of the downsample achieves higher performance may vary. Thoughtful analysis over comprehensive results further deeper our understanding of the established issue.
翻译:外部采矿界通常的做法是将分类数据集重新定位为评估各种探测模型。 为此,通常使用二进制分类数据集,将其中某一类的样本指定为内流样本,而另一类的样本则大量下取样,以创建地面真相外部样本。很少研究图表级外部检测(GLOD),但有许多潜在的具有影响力的现实世界应用。在这项研究中,我们发现一个令人感兴趣的问题,即重新定位GLOD的更深图形分类数据集。我们发现模型的二进制性能变化很大(从高到非常低,甚至比随机差),取决于哪一类的样本是内流样本,而另一类的样本则大量下取样。有趣的是,在这两个变量上,ROC-AUC大约加到1,其性能差距随着基于外部检测模型的某些传播家族的日益普及性能。我们仔细研究基于传播模型生成的图形嵌入空间的问题,并发现两个驱动因素:(1) 分类内部的密度差异,这种差异正在通过传播而放大,比随机性能分析进一步放大,而更差) 并广泛支持其他的滚动性分析。 (我们正在更深入地研究) 的平级的图像中,还可能发现,在更深层研究中找到其他的性能上层分析。