Clustering algorithms are one of the main analytical methods to detect patterns in unlabeled data. Existing clustering methods typically treat samples in a dataset as points in a metric space and compute distances to group together similar points. In this paper, we present a wholly different way of clustering points in 2-dimensional space, inspired by how humans cluster data: by training neural networks to perform instance segmentation on plotted data. Our approach, Visual Clustering, has several advantages over traditional clustering algorithms: it is much faster than most existing clustering algorithms (making it suitable for very large datasets), it agrees strongly with human intuition for clusters, and it is by default hyperparameter free (although additional steps with hyperparameters can be introduced for more control of the algorithm). We describe the method and compare it to ten other clustering methods on synthetic data to illustrate its advantages and disadvantages. We then demonstrate how our approach can be extended to higher dimensional data and illustrate its performance on real-world data. The implementation of Visual Clustering is publicly available and can be applied to any dataset in a few lines of code.
翻译:分组算法是探测未贴标签数据模式的主要分析方法之一。 现有的分组方法通常将数据集中的样本作为计量空间的点处理,并计算出相近点的距离。 在本文中,我们展示了二维空间中完全不同的组合点,其灵感来自人类群集数据:通过培训神经网络,对绘图数据进行实例分解。我们的视觉分组方法比传统的分组算法具有若干优势:它比大多数现有的分组算法(使其适合非常大的数据集)要快得多,它与人类群集直觉非常一致,它与默认的超光度计是免费的(尽管可以采用超光度计的额外步骤来进一步控制算法)。 我们描述该方法,并将其与其他十种合成数据组集法相比较,以说明其利弊。 然后我们展示我们的方法可以扩展至更高维数据,并展示其在现实世界数据上的性能。 视觉分组的实施是公开的,可以应用于少数条码中的任何数据集。