We propose a new approach, the calibrated nonparametric scan statistic (CNSS), for more accurate detection of anomalous patterns in large-scale, real-world graphs. Scan statistics identify connected subgraphs that are interesting or unexpected through maximization of a likelihood ratio statistic; in particular, nonparametric scan statistics (NPSSs) identify subgraphs with a higher than expected proportion of individually significant nodes. However, we show that recently proposed NPSS methods are miscalibrated, failing to account for the maximization of the statistic over the multiplicity of subgraphs. This results in both reduced detection power for subtle signals, and low precision of the detected subgraph even for stronger signals. Thus we develop a new statistical approach to recalibrate NPSSs, correctly adjusting for multiple hypothesis testing and taking the underlying graph structure into account. While the recalibration, based on randomization testing, is computationally expensive, we propose both an efficient (approximate) algorithm and new, closed-form lower bounds (on the expected maximum proportion of significant nodes for subgraphs of a given size, under the null hypothesis of no anomalous patterns). These advances, along with the integration of recent core-tree decomposition methods, enable CNSS to scale to large real-world graphs, with substantial improvement in the accuracy of detected subgraphs. Extensive experiments on both semi-synthetic and real-world datasets are demonstrated to validate the effectiveness of our proposed methods, in comparison with state-of-the-art counterparts.
翻译:我们提出一种新的方法,即校准非参数扫描统计(CNSS),以便更准确地检测大规模、真实世界图形中的异常模式。扫描统计通过最大程度的概率统计,确定有趣的或出乎意料的关联子子集;特别是,非参数扫描统计(NPSS),确定低于预期的个别重要节点比例的子集;然而,我们表明,最近提出的核动力源统计方法有误校,没有考虑到对子集统计的最大化。这导致微妙信号的探测能力下降,所探测到的子集甚至更强信号的精确度低。因此,我们制定了一种新的统计方法,以重新调整核动力源SS,对多个假设进行正确调整,并考虑到基本图形结构。虽然根据随机化测试进行的重新校正非常昂贵,但我们建议采用高效(近似)的算法,以及新的、封闭式的下限(预计重大节点的最大比例),用于特定规模的子集成的子集成,在不真实的假设下,在不精确度上,采用不精确度的模型,这些与大规模的模型推算得分级的模型,这些推算得更精确地推算。