Many statistical analyses assume that the data points within a sample are exchangeable and their features have some known dependency structure. Given a feature dependency structure, one can ask if the observations are exchangeable, in which case we say that they are homogeneous. Homogeneity may be the end goal of a clustering algorithm or a justification for not clustering. Apart from random matrix theory approaches, few general approaches provide statistical guarantees of exchangeability or homogeneity without labeled examples from distinct clusters. We propose a fast and flexible non-parametric hypothesis testing approach that takes as input a multivariate individual-by-feature dataset and user-specified feature dependency constraints, without labeled examples, and reports whether the individuals are exchangeable at a user-specified significance level. Our approach controls Type I error across realistic scenarios and handles data of arbitrary dimension. We perform an extensive simulation study to evaluate the efficacy of domain-agnostic tests of stratification, and find that our approach compares favorably in various scenarios of interest. Finally, we apply our approach to post-clustering single-cell chromatin accessibility data and World Values Survey data, and show how it helps to identify drivers of heterogeneity and generate clusters of exchangeable individuals.
翻译:许多统计分析假定,抽样中的数据点可以互换,其特征具有一些已知的依赖性结构。鉴于特征依赖性结构,人们可以问这些观察是否可以互换,在这样的情况下,我们可以说它们是同质的。同质性可能是组合算法的最终目标,或者不组合的理由。除了随机矩阵理论方法外,很少有一般方法提供可互换性或同质性的统计保证,而没有不同组群的标签例子。我们建议一种快速和灵活的非参数假设测试方法,将一个多变量的单细胞逐项数据集和用户指定的特征依赖性限制作为输入,不标注示例,并报告个人是否可在用户指定的意义水平上互换。我们的方法控制了I类在现实情景上的错误,并处理任意性的数据。我们进行了广泛的模拟研究,以评价对分层的域-异性测试的功效,发现我们的方法在各种利益假设中比较有利。最后,我们采用我们的方法,将后组合单细胞可获取性数据和世界价值调查数据作为输入的方法,并显示它如何有助于确定异性个体的驱动器。