Communities are a common and widely studied structure in networks, typically under the assumption that the network is fully and correctly observed. In practice, network data are often collected by querying nodes about their connections. In some settings, all edges of a sampled node will be recorded, and in others, a node may be asked to name its connections. These sampling mechanisms introduce noise and bias which can obscure the community structure and invalidate assumptions underlying standard community detection methods. We propose a general model for a class of network sampling mechanisms based on recording edges via querying nodes, designed to improve community detection for network data collected in this fashion. We model edge sampling probabilities as a function of both individual preferences and community parameters, and show community detection can be performed by spectral clustering under this general class of models. We also propose, as a special case of the general framework, a parametric model for directed networks we call the nomination stochastic block model, which allows for meaningful parameter interpretations and can be fitted by the method of moments. Both spectral clustering and the method of moments in this case are computationally efficient and come with theoretical guarantees of consistency. We evaluate the proposed model in simulation studies on both unweighted and weighted networks and apply it to a faculty hiring dataset, discovering a meaningful hierarchy of communities among US business schools.
翻译:在网络中,社区是一个共同和广泛研究的结构,通常假设网络是完全和正确观测网络的,在实践上,网络数据往往是通过询问节点来收集的,在某些情况下,抽样节点的所有边缘都将记录下来,在另一些情况下,可能会要求节点命名其联系。这些抽样机制引入噪音和偏见,从而模糊社区结构,使标准社区检测方法所依据的标准假设无效。我们提出了一个基于通过查询节点记录边缘的网络抽样机制类别的一般模式,目的是改进以这种方式收集的网络数据的社区探测。我们模拟边缘抽样概率,作为个人偏好和社区参数的功能,并显示社区探测可以通过在这一一般模式类别下的光谱集进行。我们还提议,作为总框架的一个特例,为定向网络提供一个参数模型,我们称之为提名分块模型模型,它允许有意义的参数解释,并且可以按照片刻方法加以调整。本案例的光谱组合和时钟方法都是计算有效的,并且具有理论性保证一致性。我们还提议,在模拟网络中采用一个加权和结构模型,用以测定美国各行商学院之间的结构。