In scientific studies involving analyses of multivariate data, two questions often arise for the researcher. First, is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Second, are the features independent of one another, or can the features be grouped so that the groups are mutually independent? We propose a non-parametric approach that addresses these two questions. Our approach is conceptually simple, yet fast and flexible. It controls the Type I error across realistic scenarios, and handles data of arbitrary dimensions by leveraging large-sample asymptotics. In the exchangeability detection setting, through extensive simulations and a comparison against unsupervised tests of stratification based on random matrix theory, we find that our approach compares favorably in various scenarios of interest. We apply our method to problems in population and statistical genetics, including stratification detection and linkage disequilibrium splitting. We also consider other application domains, applying our approach to post-clustering single-cell chromatin accessibility data and World Values Survey data, where we show how users can partition features into independent groups, which helps generate new scientific hypotheses about the features.
翻译:在涉及多变量数据分析的科学研究中,研究人员经常会遇到两个问题。首先,抽样可交换,这意味着样本的共同分布与单位的顺序不同?第二,样本的共同分布与单位的顺序不同;第二,样本的特征相互独立,或者特征可以分组,以便小组相互独立;我们建议了一种非参数方法,以解决这两个问题。我们的方法在概念上简单,但又快又灵活。我们的方法在现实的情景中控制了类型I的错误,并且通过利用大型样本的设置处理任意尺寸的数据。在可交换性检测设置中,通过广泛的模拟和比较,与基于随机矩阵理论的未经监督的分层测试相比较,我们发现我们的方法在各种利益假设中比较优异。我们用我们的方法处理人口和统计遗传学方面的问题,包括分辨分辨和联系不均分。我们还考虑其他应用领域,运用我们的方法,利用后集单细胞的可获取性数据和世界价值调查数据来处理任意尺寸的数据。在可交换性检测中,我们通过广泛的模拟和比较方法,我们发现用户如何将分区特征分成独立的特性纳入独立的组别,从而产生新的科学模型。