Inference in clustering is paramount to uncovering inherent group structure in data. Clustering methods which assess statistical significance have recently drawn attention owing to their importance for the identification of patterns in high dimensional data with applications in many scientific fields. We present here a U-statistics based approach, specially tailored for high-dimensional data, that clusters the data into three groups while assessing the significance of such partitions. Because our approach stands on the U-statistics based clustering framework of the methods in R package uclust, it inherits its characteristics being a non-parametric method relying on very few assumptions about the data, and thus can be applied to a wide range of dataset. Furthermore our method aims to be a more powerful tool to find the best partitions of the data into three groups when that particular structure is present. In order to do so, we first propose an extension of the test U-statistic and develop its asymptotic theory. Additionally we propose a ternary non-nested significance clustering method. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. Applications to peripheral blood mononuclear cells and to image recognition shows the versatility of our proposal, presenting a superior performance when compared with other approaches.
翻译:群集的推论对于揭示数据中固有的群集结构至关重要。评估统计重要性的群集方法最近引起了人们的注意,因为它们对于确定具有许多科学领域的应用的高度数据模式十分重要。我们在这里提出了一个基于U-统计的方法,专门为高维数据专门设计,将数据分组成三个组,同时评估这种分区的意义。由于我们的方法建立在基于U-统计集成框架的R软件包U-Clulet中方法的集合框架之上,因此它继承了它的特点,这是一种非参数方法,它依赖对数据的很少的假设,因此可以应用于广泛的数据集。此外,我们的方法旨在成为一个更强大的工具,在特定结构存在时,找到数据的最佳分割成三个组。为了这样做,我们首先提议扩大U- Statistic测试并发展其无温理理论。我们提出了一种不留意的群集方法。我们的方法是通过多种模拟测试,发现其统计能力比所有设想中考虑的相竞替代方法都要大。我们的方法旨在找出数据的最佳分割方法,当存在特定结构时,将数据分成三个组。为了显示我们边缘的单一核细胞和图像识别,应用时,将显示其他高端的外核试验。