We consider the problem of computing the k-means centers for a large high-dimensional dataset in the context of edge-based machine learning, where data sources offload machine learning computation to nearby edge servers. k-Means computation is fundamental to many data analytics, and the capability of computing provably accurate k-means centers by leveraging the computation power of the edge servers, at a low communication and computation cost to the data sources, will greatly improve the performance of these analytics. We propose to let the data sources send small summaries, generated by joint dimensionality reduction (DR) and cardinality reduction (CR), to support approximate k-means computation at reduced complexity and communication cost. By analyzing the complexity, the communication cost, and the approximation error of k-means algorithms based on state-of-the-art DR/CR methods, we show that: (i) it is possible to achieve a near-optimal approximation at a near-linear complexity and a constant or logarithmic communication cost, (ii) the order of applying DR and CR significantly affects the complexity and the communication cost, and (iii) combining DR/CR methods with a properly configured quantizer can further reduce the communication cost without compromising the other performance metrics. Our findings are validated through experiments based on real datasets.
翻译:我们考虑了在边基机器学习的背景下计算大型高维数据集的K值中心的问题,在边基机器学习的背景下,数据源将机器学习的计算从机器学习计算中卸下到附近的边缘服务器。 k- Means计算对于许多数据分析而言至关重要,而通过利用边端服务器的计算能力,以较低的通信和计算成本计算数据源,计算准确的K值中心的能力将大大改善这些分析器的性能。我们提议让数据源发送通过联合维度减低和基点减低产生的小摘要,以支持以降低复杂性和通信成本的方式计算大约K值。通过分析基于最新水平的DR/CR方法的复杂程度、通信成本和K- mean值算法的近似差,我们表明:(一) 有可能在近线复杂性和固定或对数通信成本减低所产生的接近性近于最佳的近似性近性近性,(二) 应用DR和CR的顺序会大大影响复杂性和通信成本降低的通信成本和通信成本。 (三) 通过不准确地将我们的通信结果纳入其他数据库,将业绩/RRR的模型加以调整。