In the big data era, the key feature that each algorithm needs to have is the possibility of efficiently running in parallel in a distributed environment. The popular Silhouette metric to evaluate the quality of a clustering, unfortunately, does not have this property and has a quadratic computational complexity with respect to the size of the input dataset. For this reason, its execution has been hindered in big data scenarios, where clustering had to be evaluated otherwise. To fill this gap, in this paper we introduce the first algorithm that computes the Silhouette metric with linear complexity and can easily execute in parallel in a distributed environment. Its implementation is freely available in the Apache Spark ML library.
翻译:在大数据时代,每个算法必须具备在分布式环境中高效运行的可能性。对于评估聚类质量的流行轮廓指标来说,不幸的是,它没有这个属性,并且与输入数据集的大小呈二次计算复杂性。因此,它的执行在大数据场景中受到了阻碍。为此,本文介绍了第一个使用线性复杂度计算轮廓指标并可以容易地在分布式环境中并行执行的算法。其实现在Apache Spark ML库中免费提供。