Test of independence is of fundamental importance in modern data analysis, with broad applications in variable selection, graphical models, and causal inference. When the data is high dimensional and the potential dependence signal is sparse, independence testing becomes very challenging without distributional or structural assumptions. In this paper we propose a general framework for independence testing by first fitting a classifier that distinguishes the joint and product distributions, and then testing the significance of the fitted classifier. This framework allows us to borrow the strength of the most advanced classification algorithms developed from the modern machine learning community, making it applicable to high dimensional, complex data. By combining a sample split and a fixed permutation, our test statistic has a universal, fixed Gaussian null distribution that is independent of the underlying data distribution. Extensive simulations demonstrate the advantages of the newly proposed test compared with existing methods. We further apply the new test to a single cell data set to test the independence between two types of single cell sequencing measurements, whose high dimensionality and sparsity make existing methods hard to apply.
翻译:独立测试在现代数据分析中具有根本重要性,在变量选择、图形模型和因果推断中具有广泛的应用。当数据是高维的,潜在依赖信号稀少时,独立测试就变得非常具有挑战性,没有分布或结构假设。在本文件中,我们提出独立测试的一般框架,先安装一个区分联合和产品分布的分类器,然后测试装配的分类器的意义。这个框架允许我们借用现代机器学习界所开发的最先进的分类算法的强度,使之适用于高维、复杂的数据。通过将样本分离和固定的调整结合起来,我们的测试统计有一个独立于基本数据分布的通用、固定的高斯无效分布。广泛的模拟显示了新提议的测试与现有方法相比的优势。我们进一步将新的测试应用于一个单一细胞数据集,测试两种类型的单细胞测序测量方法的独立性,其高度的维度和敏度使得现有方法难以应用。