Clustering is an unsupervised machine learning method grouping data samples into clusters of similar objects. In practice, clustering has been used in numerous applications such as banking customers profiling, document retrieval, image segmentation, and e-commerce recommendation engines. However, the existing clustering techniques present significant limitations, from which is the dependability of their stability on the initialization parameters (e.g. number of clusters, centroids). Different solutions were presented in the literature to overcome this limitation (i.e. internal and external validation metrics). However, these solutions require high computational complexity and memory consumption, especially when dealing with big data. In this paper, we apply the recent object detection Deep Learning (DL) model, named YOLO-v5, to detect the initial clustering parameters such as the number of clusters with their sizes and centroids. Mainly, the proposed solution consists of adding a DL-based initialization phase making the clustering algorithms free of initialization. Two model solutions are provided in this work, one for isolated clusters and the other one for overlapping clusters. The features of the incoming dataset determine which model to use. Moreover, The results show that the proposed solution can provide near-optimal clusters initialization parameters with low computational and resources overhead compared to existing solutions.
翻译:集群是一种未经监督的机器学习方法,将数据样本分组为相似对象群集。实际上,在银行客户特征分析、文件检索、图像分割和电子商务建议引擎等许多应用中,都使用了集群,但是,现有的集群技术具有很大的局限性,其稳定性取决于初始化参数(例如集群数量、中间体),文献中提出了不同的解决方案,以克服这一局限性(即内部和外部验证尺度)。然而,这些解决方案需要高计算复杂性和记忆消耗,特别是在处理大数据时。在本文件中,我们应用最新的天体探测深度学习模型(DL),名为YOLO-v5,以探测最初的集群参数,例如其大小和中间体的集群数量。主要,拟议的解决方案包括增加基于DL的初始化阶段,使群集算法免于初始化。在这项工作中,提供了两种模型解决方案,一种是孤立的集群,另一种是重叠的集群。在本文中,我们采用最近收到的数据集的特征,确定了使用哪种模型。此外,将低级模型与现有模型进行比较,结果显示现有模型的计算方法可以提供现有的低级群集。