Network data is prevalent in many contemporary big data applications in which a common interest is to unveil important latent links between different pairs of nodes. Yet a simple fundamental question of how to precisely quantify the statistical uncertainty associated with the identification of latent links still remains largely unexplored. In this paper, we propose the method of statistical inference on membership profiles in large networks (SIMPLE) in the setting of degree-corrected mixed membership model, where the null hypothesis assumes that the pair of nodes share the same profile of community memberships. In the simpler case of no degree heterogeneity, the model reduces to the mixed membership model for which an alternative more robust test is also proposed. Both tests are of the Hotelling-type statistics based on the rows of empirical eigenvectors or their ratios, whose asymptotic covariance matrices are very challenging to derive and estimate. Nevertheless, their analytical expressions are unveiled and the unknown covariance matrices are consistently estimated. Under some mild regularity conditions, we establish the exact limiting distributions of the two forms of SIMPLE test statistics under the null hypothesis and contiguous alternative hypothesis. They are the chi-square distributions and the noncentral chi-square distributions, respectively, with degrees of freedom depending on whether the degrees are corrected or not. We also address the important issue of estimating the unknown number of communities and establish the asymptotic properties of the associated test statistics. The advantages and practical utility of our new procedures in terms of both size and power are demonstrated through several simulation examples and real network applications.
翻译:在许多当代大数据应用程序中,网络数据很普遍,共同感兴趣的是揭示不同节点之间重要的潜在联系。然而,如何准确量化与识别潜在联系有关的统计不确定性这一简单的基本问题基本上仍未探讨。在本文件中,我们提议在大型网络(SIMPLE)成员概况中采用统计推论方法,以设定程度修正混合成员模式,其中无效假设假定对节点的组合具有相同社区成员特征。在没有程度差异的简单实例中,该模型减少为混合成员模式,为此还提议采用另一种更强有力的测试。两次测试都是基于经验性精选因素行或其比率的酒店型统计数据,而这些网络的细微差异矩阵非常难以得出和估计。尽管如此,它们的分析表述公开,对未知的变异矩阵进行了一致的估计。在一些不太正常的常规条件下,我们确定了两种形式的SIMPLE测试统计数据的精确分布范围,在无效假设和相近相近的替代假设中也提出了一种准确的模型。我们如何通过不同层次和不同程度的统计分布,我们如何通过不同的标准、不同层次的统计是不同的标准,我们如何以不同程度和不同层次的分布。