Clustering is a fundamental problem in machine learning where distance-based approaches have dominated the field for many decades. This set of problems is often tackled by partitioning the data into K clusters where the number of clusters is chosen apriori. While significant progress has been made on these lines over the years, it is well established that as the number of clusters or dimensions increase, current approaches dwell in local minima resulting in suboptimal solutions. In this work, we propose a new set of distance threshold methods called Theta-based Algorithms (ThetA). Via experimental comparisons and complexity analyses we show that our proposed approach outperforms existing approaches in: a) clustering accuracy and b) time complexity. Additionally, we show that for a large class of problems, learning the optimal threshold is straightforward in comparison to learning K. Moreover, we show how ThetA can infer the sparsity of datasets in higher dimensions.
翻译:在机器学习中,基于远程的方法几十年来一直主导着实地。这组问题往往通过将数据分成K组来解决,在K组中优先选择了组数。虽然多年来在这些行上取得了显著进展,但人们已经清楚地认识到,随着组数或维度的增加,目前的方法存在于本地微型中,导致不理想的解决办法。在这项工作中,我们提出了一套新的远程阈值方法,称为“基于Teta的Algorithms(ThetA)。通过实验性比较和复杂性分析,我们发现我们拟议的方法在以下几个方面超过了现有方法:(a) 组合精度和(b) 时间复杂性。此外,我们表明,对于一大批问题,学习最佳门槛与学习K相比是直截了当的。此外,我们展示了“ThetA”如何推导出更高层面数据集的广度。