Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. The performance of parallel implementations suffer from synchronous barriers for each iteration and skewed workloads. We rethink the parallelization of clustering for modern non-uniform memory architectures (NUMA) to maximizes independent, asynchronous computation. We eliminate many barriers, reduce remote memory accesses, and maximize cache reuse. We implement the 'Clustering NUMA Optimized Routines' (clusterNOR) extensible parallel framework that provides algorithmic building blocks. The system is generic, we demonstrate nine modern clustering algorithms that have simple implementations. clusterNOR includes (i) in-memory, (ii) semi-external memory, and (iii) distributed memory execution, enabling computation for varying memory and hardware budgets. For algorithms that rely on Euclidean distance, clusterNOR defines an updated Elkan's triangle inequality pruning algorithm that uses asymptotically less memory so that it works on billion-point data sets. clusterNOR extends and expands the scope of the 'knor' library for k-means clustering by generalizing underlying principles, providing a uniform programming interface and expanding the scope to hierarchical and linear algebraic classes of algorithms. The compound effect of our optimizations is an order of magnitude improvement in speed over other state-of-the-art solutions, such as Spark's MLlib and Apple's Turi.
翻译:集成算法是迭代的,具有复杂的数据存取模式,导致许多小型随机内存存存取。 平行执行的性能受到每个迭代和偏斜工作量的同步障碍的影响。 我们重新思考现代非统一内存结构(NUMA)组群的平行化, 以最大限度地实现独立和无同步的计算; 我们消除了许多屏障, 减少远程内存存存存存存取, 并最大限度地回收缓存。 我们实施了“ 封闭 NUMA 优化鲁丁斯( CroptNOR) ” (CroptNOR) 的平行框架, 以提供算法的构建。 系统是通用的, 我们展示了9个现代组群算法的算法, 并展示了有简单执行功能的系统。 集成( i) 内模组、 (ii) 半外部内存和(iii) 分布式内存执行。 对于依赖Euclideidean距离的算法, CrodNOR定义了更新的三角不平等调算算法, 以提供十亿点数据组。 的CNC- NCNONOR 扩大和Scal- glaslial- groupal rodual- grol- glasmalal roducal- 范围, 范围的系统, 范围为整个的系统, 范围为普通级级的系统, 范围为普通级的系统, 范围, 的系统级平流级平序级平基级平序号的系统。