AnchorGAE: 通过美元(n)美元对一般数据分组 (AnchorGAE: General Data Clustering via $O(n)$ Bipartite Graph Convolution)

from arxiv, copyright 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Since the representative capacity of graph-based clustering methods is usually limited by the graph constructed on the original features, it is attractive to find whether graph neural networks (GNNs) can be applied to augment the capacity. The core problems mainly come from two aspects: (1) the graph is unavailable in the most clustering scenes so that how to construct high-quality graphs on the non-graph data is usually the most important part; (2) given n samples, the graph-based clustering methods usually consume at least $\mathcal O(n^2)$ time to build graphs and the graph convolution requires nearly $\mathcal O(n^2)$ for a dense graph and $\mathcal O(|\mathcal{E}|)$ for a sparse one with $|\mathcal{E}|$ edges. Accordingly, both graph-based clustering and GNNs suffer from the severe inefficiency problem. To tackle these problems, we propose a novel clustering method, AnchorGAE, with the self-supervised estimation of graph and efficient graph convolution. We first show how to convert a non-graph dataset into a graph dataset, by introducing the generative graph model and anchors. We then show that the constructed bipartite graph can reduce the computational complexity of graph convolution from $\mathcal O(n^2)$ and $\mathcal O(|\mathcal{E}|)$ to $\mathcal O(n)$. The succeeding steps for clustering can be easily designed as $\mathcal O(n)$ operations. Interestingly, the anchors naturally lead to siamese architecture with the help of the Markov process. Furthermore, the estimated bipartite graph is updated dynamically according to the features extracted by GNN, to promote the quality of the graph. However, we theoretically prove that the self-supervised paradigm frequently results in a collapse that often occurs after 2-3 update iterations in experiments, especially when the model is well-trained. A specific strategy is accordingly designed to prevent the collapse.

翻译：由于基于图形的群集方法的代表性能力通常受原始特性所构造的图表限制 { 基于图形的群集方法通常受原始特性所构建的图表的限制, 找到图表神经网络(GNNS)是否可用于增强能力是很有吸引力的。核心问题主要来自两个方面:(1) 该图表在大多数组群的场景中是无法使用的, 从而如何在非图像数据中构建高质量的图表; (2) 给定的样本, 基于图形的群集方法通常至少花费$\mathcal O(n%2) 的时间来构建图形, 而图形的卷心需要近于$mathal O(n%2)$(GNNNNGNS) 。我们首先展示如何将一个非图表级的O(macral$) 的崩溃网络化数据转换为Orental- millal roupal 。