Network-based clustering methods frequently require the number of communities to be specified \emph{a priori}. Moreover, most of the existing methods for estimating the number of communities assume the number of communities to be fixed and not scale with the network size $n$. The few methods that assume the number of communities to increase with the network size $n$ are only valid when the average degree $d$ of a network grows at least as fast as $O(n)$ (i.e., the dense case) or lies within a narrow range. This presents a challenge in clustering large-scale network data, particularly when the average degree $d$ of a network grows slower than the rate of $O(n)$ (i.e., the sparse case). To address this problem, we proposed a new sequential procedure utilizing multiple hypothesis tests and the spectral properties of Erd\"{o}s R\'{e}nyi graphs for estimating the number of communities in sparse stochastic block models (SBMs). We prove the consistency of our method for sparse SBMs for a broad range of the sparsity parameter. As a consequence, we discover that our method can estimate the number of communities $K^{(n)}_{\star}$ with $K^{(n)}_{\star}$ increasing at the rate as high as $O(n^{(1 - 3\gamma)/(4 - 3\gamma)})$, where $d = O(n^{1 - \gamma})$. Moreover, we show that our method can be adapted as a stopping rule in estimating the number of communities in binary tree stochastic block models. We benchmark the performance of our method against other competing methods on six reference single-cell RNA sequencing datasets. Finally, we demonstrate the usefulness of our method through numerical simulations and by using it for clustering real single-cell RNA-sequencing datasets.
翻译:网基群集方法通常要求指定社区数量 = emph{ a sisteri} 。 此外,大多数现有的估计社区数量的方法都假定社区数量固定,而不是网络规模的美元。 假设社区数量随着网络规模的增加而增加的少数方法只有在网络的平均水平至少与美元(n)一样快速增长(即密度)或位于狭小范围内的情况下才有效。 特别是当一个网络的平均水平为美元(n), 而不是网络规模的美元时, 特别是当一个网络的平均水平为美元(n) (即稀释案件)。 为了解决这个问题,我们建议采用新的顺序程序, 使用多种假设测试和Erd\\"{o}R\\\'e} 光谱特性, 用于估算稀释区块模型中的社区数量(SBM) 和(SBM) 数字(SB) 。 我们证明我们对于广泛范围的磁基值参数所用的稀释SBM方法的一致性, 也就是在 $ 美元基准值参数上显示我们的方法。