Rejecting the null hypothesis in two-sample testing is a fundamental tool for scientific discovery. Yet, aside from concluding that two samples do not come from the same probability distribution, it is often of interest to characterize how the two distributions differ. Given samples from two densities $f_1$ and $f_0$, we consider the task of localizing occurrences of the inequality $f_1 > f_0$. To avoid the challenges associated with high-dimensional space, we propose a general hypothesis testing framework where hypotheses are formulated adaptively to the data by conditioning on the combined sample from the two densities. We then investigate a special case of this framework where the notion of locality is captured by a random walk on a weighted graph constructed over this combined sample. We derive a tractable testing procedure for this case employing a type of scan statistic, and provide non-asymptotic lower bounds on the power and accuracy of our test to detect whether $f_1>f_0$ in a local sense. Furthermore, we characterize the test's consistency according to a certain problem-hardness parameter, and show that our test achieves the minimax detection rate for this parameter. We conduct numerical experiments to validate our method, and demonstrate our approach on two real-world applications: detecting and localizing arsenic well contamination across the United States, and analyzing two-sample single-cell RNA sequencing data from melanoma patients.
翻译:拒绝两次抽样测试的无效假设是科学发现的基本工具。然而,除了得出两个样本并非来自同一概率分布的同一概率分布之外,我们往往有兴趣说明两种分布方式的不同。根据两个密度的样本,我们考虑将不平等发生地点本地化的任务 $f_1 > f_0美元。为了避免与高维空间相关的挑战,我们提议了一个一般性假设测试框架,通过调整两种密度的合并样本来适应数据。我们然后调查这一框架的一个特殊案例,在这个框架中,地点的概念是通过随机走动以构建的合并样本的加权图表来捕捉到的。我们用一种扫描统计,为这个案例制定一种可移植的测试程序,并且提供非非被动的低限我方测试,以便从本地角度检测是否为$_1>f_f_0美元。此外,我们根据某种困难度参数来描述测试的一致性。我们在这个框架中,我们测试地点概念的概念是通过在这个综合样本的加权图表上,通过随机行走来捕捉到一个地点的概念测试程序。我们用一种测试方法,通过一种精确度来测量我们的数据,从本地的测算方法,用两种测算方法来测算。