With rapid developments of information and technology, large scale network data are ubiquitous. In this work we develop a distributed spectral clustering algorithm for community detection in large scale networks. To handle the problem, we distribute l pilot network nodes on the master server and the others on worker servers. A spectral clustering algorithm is first conducted on the master to select pseudo centers. The indexes of the pseudo centers are then broadcasted to workers to complete distributed community detection task using a SVD type algorithm. The proposed distributed algorithm has three merits. First, the communication cost is low since only the indexes of pseudo centers are communicated. Second, no further iteration algorithm is needed on workers and hence it does not suffer from problems as initialization and non-robustness. Third, both the computational complexity and the storage requirements are much lower compared to using the whole adjacency matrix. A Python package DCD (www.github.com/Ikerlz/dcd) is developed to implement the distributed algorithm for a Spark system. Theoretical properties are provided with respect to the estimation accuracy and mis-clustering rates. Lastly, the advantages of the proposed methodology are illustrated by experiments on a variety of synthetic and empirical datasets.
翻译:随着信息和技术的迅速发展,大规模网络数据无处不在。在这项工作中,我们为大型网络的社区探测开发了分布式光谱聚变算法。为了解决问题,我们在主服务器上和工人服务器上分配了一个试点网络节点。首先在主服务器上进行光谱聚变算法,以选择假中心。然后向工人播放伪中心索引,以便使用SVD型算法完成分布式社区检测任务。拟议的分布式算法有三长优点。首先,通信成本较低,因为只传播伪中心索引。第二,不需要工人的进一步循环算法,因此,它不会在初始化和非机器人状态上遇到问题。第三,计算复杂程度和储存要求都比使用整个相近矩阵要低得多。开发了Python软件包DCD(www.github.com/Ikerlz/dcd),以实施Spark系统的分布式算法。提供了理论属性,关于估计和误归集率的准确性和误归集率。最后,通过综合数据实验展示了拟议方法的优势。