In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for both exploration and testing can lead to massive selection bias, leading to many false discoveries. Selective inference is a framework that allows for performing valid inference even when the same data is reused for exploration and testing. In this work, we are interested in the problem of selective inference for data clustering, where a clustering procedure is used to hypothesize a separation of the data points into a collection of subgroups, and we then wish to test whether these data-dependent clusters in fact represent meaningful differences within the data. Recent work by Gao et al. [2022] provides a framework for doing selective inference for this setting, where the hierarchical clustering algorithm is used for producing the cluster assignments, which was then extended to k-means clustering by Chen and Witten [2022]. Both these works rely on assuming a known covariance structure for the data, but in practice, the noise level needs to be estimated-and this is particularly challenging when the true cluster structure is unknown. In our work, we extend to the setting of noise with unknown variance, and provide a selective inference method for this more general setting. Empirical results show that our new method is better able to maintain high power while controlling Type I error when the true noise level is unknown.
翻译:在许多现代统计问题中,必须使用有限的可用数据来开发用于测试的假设,并测试这些假设,即用于探索和确认性数据分析的假设。在勘探和测试时,重新使用同一数据集可能导致大规模选择偏差,导致许多虚假发现。选择性推断是一个框架,允许即使在同一数据被再用于勘探和测试时,也能进行有效的推断。在这项工作中,我们关心数据分组有选择地推断问题,即采用分组程序将数据点分离成分组集,然后我们想测试这些数据依赖的组群是否事实上代表数据中有意义的差异。加奥等人(2022年)最近的工作为这一环境提供了一种有选择地推断的框架,在这一环境中,使用等级组合算法来进行集群任务,然后扩大到陈和威滕(2022年)的K点组合。这两个工作都依赖于假设已知的数据变量结构,但在实践中,噪音水平需要测试这些数据组群群实际上是否代表了数据中有意义的差异。当我们这一未知的等级和选择性方法的深度时,我们需要更精确地显示我们这个未知的等级结构的精确度是更难的。