Clustering is part of unsupervised analysis methods that consist in grouping samples into homogeneous and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis testing is often used to infer the variables that significantly separate the estimated clusters from each other. However, data-driven hypotheses are considered for the inference process, since the hypotheses are derived from the clustering results. This double use of the data leads traditional hypothesis test to fail to control the Type I error rate particularly because of uncertainty in the clustering process and the potential artificial differences it could create. We propose three novel statistical hypothesis tests which account for the clustering process. Our tests efficiently control the Type I error rate by identifying only variables that contain a true signal separating groups of observations.
翻译:分组是未经监督的分析方法的一部分,包括将样本分组成同质和单独的观察分组,也称为分组。为了解释组群,统计假设测试常常用来推断显著区分估计组群的变量。然而,数据驱动的假设是在推断过程中考虑的,因为假设是从分组结果中得出的。这种数据双重使用导致传统的假设测试无法控制I型错误率,特别是由于组群过程的不确定性及其可能造成的潜在人为差异。我们提出了三种新的统计假设测试,以说明组群过程。我们的测试有效地控制了I型错误率,只确定了含有真实信号的分离观察组的变量。