An optimal data partitioning in parallel & distributed implementation of clustering algorithms is a necessary computation as it ensures independent task completion, fair distribution, less number of affected points and better & faster merging. Though partitioning using Kd Tree is being conventionally used in academia, it suffers from performance drenches and bias (non equal distribution) as dimensionality of data increases and hence is not suitable for practical use in industry where dimensionality can be of order of 100s to 1000s. To address these issues we propose two new partitioning techniques using existing mathematical models & study their feasibility, performance (bias and partitioning speed) & possible variants in choosing initial seeds. First method uses an n dimensional hashed grid based approach which is based on mapping the points in space to a set of cubes which hashes the points. Second method uses a tree of voronoi planes where each plane corresponds to a partition. We found that grid based approach was computationally impractical, while using a tree of voronoi planes (using scalable K-Means++ initial seeds) drastically outperformed the Kd-tree tree method as dimensionality increased.
翻译:在平行和分布地实施集群算法时,最佳的数据分割是必需的计算,因为它能确保独立的任务完成、公平分配、减少受影响的点数以及更好和更快的合并。虽然使用Kd树的分割正在学术界传统地使用,但随着数据维度的提高,它会受到性能偏差和偏差(不平均分布)的影响,因此不适合在多元性可以达到100到1000秒的工业中实际使用。为了解决这些问题,我们提议采用两种新的分割技术,利用现有数学模型并研究其可行性、性能(坡度和偏移速度)和选择初始种子的可能变异物。第一种方法是使用基于空间点绘图的正维面散状网格方法,以具有点数的立方块为基础。第二种方法是使用每平面与分区相匹配的体形平面树。我们发现,基于网格的方法在计算上不切实际,同时使用螺旋形平面的树(使用可伸缩的K-Means+初始种子)大大超过Kd树方法。