While there have been numerous sequential algorithms developed to estimate community structure in networks, there is little available guidance and study of what significance level or stopping parameter to use in these sequential testing procedures. Most algorithms rely on prespecifiying the number of communities or use an arbitrary stopping rule. We provide a principled approach to selecting a nominal significance level for sequential community detection procedures by controlling the tolerance ratio, defined as the ratio of underfitting and overfitting probability of estimating the number of clusters in fitting a network. We introduce an algorithm for specifying this significance level from a user-specified tolerance ratio, and demonstrate its utility with a sequential modularity maximization approach in a stochastic block model framework. We evaluate the performance of the proposed algorithm through extensive simulations and demonstrate its utility in controlling the tolerance ratio in single-cell RNA sequencing clustering by cell type and by clustering a congressional voting network.
翻译:虽然为估计网络中的社区结构制定了许多顺序算法,但对于这些顺序测试程序中使用哪些重要程度或停止参数,几乎没有可用的指导和研究。大多数算法依赖预先预测社区数量或使用任意停止规则。我们提供了一种原则性办法,通过控制容忍率,为顺序社区检测程序选择名义意义水平。 容忍率的定义是,在对网络进行匹配时,组群数量估计的误差和误差率比率。我们采用一种算法,从用户指定的容忍率中确定这一重要程度,并用连续模块化最大化方法在随机区块模型框架中展示其实用性。我们通过广泛的模拟来评估拟议的算法的性能,并展示其在控制单细胞RNA按细胞类型排列群集的容忍率和将国会投票网组合在一起方面的实用性。