Random-walk based network embedding algorithms like node2vec and DeepWalk are widely used to obtain Euclidean representation of the nodes in a network prior to performing down-stream network inference tasks. Nevertheless, despite their impressive empirical performance, there is a lack of theoretical results explaining their behavior. In this paper we studied the node2vec and DeepWalk algorithms through the perspective of matrix factorization. We analyze these algorithms in the setting of community detection for stochastic blockmodel graphs; in particular we established large-sample error bounds and prove consistent community recovery of node2vec/DeepWalk embedding followed by k-means clustering. Our theoretical results indicate a subtle interplay between the sparsity of the observed networks, the window sizes of the random walks, and the convergence rates of the node2vec/DeepWalk embedding toward the embedding of the true but unknown edge probabilities matrix. More specifically, as the network becomes sparser, our results suggest using larger window sizes, or equivalently, taking longer random walks, in order to attain better convergence rate for the resulting embeddings. The paper includes numerical experiments corroborating these observations.
翻译:以随机行道为基础的网络嵌入算法,如 node2vec 和 DeepWalk 被广泛用于获取网络中节点在下游网络推导任务之前的 Euclide 代表。 然而,尽管它们的实验性表现令人印象深刻,但缺乏解释其行为的理论结果。 在本文中,我们从矩阵系数化的角度研究了节点2vec 和 DeepWalk 算法。 我们在为随机区块模型图设定社区检测时分析了这些算法; 特别是,我们建立了大缩放错误界限,并证明在网络中持续恢复了节点2vec/DeepWalk在K- means群集之后的节点。 我们的理论结果显示,观测网络的宽度、随机行走的窗口大小以及Nde2vec/DeepWalk 嵌入真实但未知边缘概率矩阵的聚合率。 更具体地说,当网络变得稀疏时,我们的结果显示,我们使用更大的窗口大小或相当的随机行距进行社区恢复。 为了进行更精确的观测, 从而实现更精确的嵌入。