Many testing problems are readily amenable to randomised tests such as those employing data splitting, which divide the data into disjoint parts for separate purposes. However despite their usefulness in principle, randomised tests have obvious drawbacks. Firstly, two analyses of the same dataset may lead to different results. Secondly, the test typically loses power because it does not fully utilise the entire sample. As a remedy to these drawbacks, we study how to combine the test statistics or p-values resulting from multiple random realisations such as through random data splits. We introduce rank-transformed subsampling as a general method for delivering large sample inference about the combined statistic or p-value under mild assumptions. We apply our methodology to a range of problems, including testing unimodality in high-dimensional data, testing goodness-of-fit of parametric quantile regression models, testing no direct effect in a sequentially randomised trial and calibrating cross-fit double machine learning confidence intervals. For the latter, our method improves coverage in finite samples and for the testing problems, our method is able to derandomise and improve power. Moreover, in contrast to existing p-value aggregation schemes that can be highly conservative, our method enjoys type-I error control that asymptotically approaches the nominal level.
翻译:许多测试问题很容易被随机抽查,例如使用数据分割的方法,将数据分成不相连的部分,用于不同的用途。然而,尽管在原则上是有用的,随机抽查有明显的缺点。首先,对同一数据集进行的两项分析可能会导致不同的结果。第二,测试通常会因为没有充分利用整个样本而失去力量。作为这些缺点的一种补救办法,我们研究如何将测试统计数据或通过随机数据分割等多种随机实现的结果产生的p值结合起来。我们采用按等级转换的子抽样作为在轻度假设下对综合统计或p价值进行大量抽样推断的一般方法。我们用我们的方法处理一系列问题,包括测试高度数据的单式数据,测试偏重度回归模型的最佳性,测试按顺序随机的试验没有直接效果,并校准通过随机数据分割等多种随机实现的双机学习信任度。对于后者,我们的方法改进了定点抽样和测试问题的范围,我们的方法能够解调和改进权力。此外,我们的方法能够对一系列问题进行大量抽样推断,包括测试高度数据的单式模型进行测试,测试,测试是否适合偏重重重的回归模型。此外,我们采用高估定的组合方法。