For high dimensional data, where P features for N objects (P >> N) are represented in an NxP matrix X, we describe a clustering algorithm based on the normalized left Gram matrix, G = XX'/P. Under certain regularity conditions, the rows in G that correspond to objects in the same cluster converge to the same mean vector. By clustering on the row means, the algorithm does not require preprocessing by dimension reduction or feature selection techniques and does not require specification of tuning or hyperparameter values. Because it is based on the NxN matrix G, it has a lower computational cost than many methods based on clustering the feature matrix X. When compared to 14 other clustering algorithms applied to 32 benchmarked microarray datasets, the proposed algorithm provided the most accurate estimate of the underlying cluster configuration more than twice as often as its closest competitors.
翻译:对于高维数据,NxP矩阵X中代表了N对象的P特性(P ⁇ ⁇ N),我们描述了基于普通左格拉姆矩阵的组合算法,G = XX'/P。在某些常规条件下,与同一组群中的物体对应的G行与同一平均矢量汇合。通过在行中的组合手段,该算法不需要通过降低尺寸或特征选择技术进行预处理,也不需要具体规定调制或超参数值。由于该算法以NxN矩阵G为基础,因此其计算成本低于基于特征矩阵X的多种方法。与适用于32个基准微阵列数据集的14个其他组合算法相比,拟议的算法提供了最准确的组群配置估计数,其频率是其最接近的竞争者的两倍多。