Identifying the number of communities is a fundamental problem in community detection, which has received increasing attention recently. However, rapid advances in technology have led to the emergence of large-scale networks in various disciplines, thereby making existing methods computationally infeasible. To address this challenge, we propose a novel subsampling-based modified Bayesian information criterion (SM-BIC) for identifying the number of communities in a network generated via the stochastic block model and degree-corrected stochastic block model. We first propose a node-pair subsampling method to extract an informative subnetwork from the entire network, and then we derive a purely data-driven criterion to identify the number of communities for the subnetwork. In this way, the SM-BIC can identify the number of communities based on the subsampled network instead of the entire dataset. This leads to important computational advantages over existing methods. We theoretically investigate the computational complexity and identification consistency of the SM-BIC. Furthermore, the advantages of the SM-BIC are demonstrated by extensive numerical studies.
翻译:在社区检测中确定社区数量是一个基本问题,近年来受到越来越多的关注。然而,技术的快速进步导致各个学科产生了大规模的网络,因此现有的方法在计算上变得不可行。为了解决这个挑战,我们提出了一种新颖的基于子采样的修正贝叶斯信息准则(SM-BIC)来确定随机块模型和度校正随机块模型生成的网络中的社区数量。我们首先提出一种节点对子采样方法来从整个网络中提取一个信息子网络,然后我们导出了一个纯数据驱动的准则来为子网络确定社区数量。通过这种方式,SM-BIC可以基于子采样网络而不是整个数据集来确定社区数量。这比现有方法具有重要的计算优势。我们从理论上研究了SM-BIC的计算复杂度和识别一致性。此外,通过广泛的数值研究证明了SM-BIC的优点。