The two-sample problem, which consists in testing whether independent samples on $\mathbb{R}^d$ are drawn from the same (unknown) distribution, finds applications in many areas. Its study in high-dimension is the subject of much attention, especially because the information acquisition processes at work in the Big Data era often involve various sources, poorly controlled, leading to datasets possibly exhibiting a strong sampling bias. While classic methods relying on the computation of a discrepancy measure between the empirical distributions face the curse of dimensionality, we develop an alternative approach based on statistical learning and extending rank tests, capable of detecting small departures from the null assumption in the univariate case when appropriately designed. Overcoming the lack of natural order on $\mathbb{R}^d$ when $d\geq 2$, it is implemented in two steps. Assigning to each of the samples a label (positive vs. negative) and dividing them into two parts, a preorder on $\mathbb{R}^d$ defined by a real-valued scoring function is learned by means of a bipartite ranking algorithm applied to the first part and a rank test is applied next to the scores of the remaining observations to detect possible differences in distribution. Because it learns how to project the data onto the real line nearly like (any monotone transform of) the likelihood ratio between the original multivariate distributions would do, the approach is not much affected by the dimensionality, ignoring ranking model bias issues, and preserves the advantages of univariate rank tests. Nonasymptotic error bounds are proved based on recent concentration results for two-sample linear rank-processes and an experimental study shows that the approach promoted surpasses alternative methods standing as natural competitors.
翻译:两层问题, 包括测试 $mathbb{R ⁇ d$ 的独立样本是否来自同一( 未知的) 分布, 是否在很多领域找到应用。 它的高级研究是人们非常关注的话题, 特别是因为大数据时代工作的信息获取过程往往涉及各种来源, 控制不力, 导致数据集可能表现出强烈的抽样偏差。 虽然根据经验分布间差异度的计算方法, 面临维度的诅咒, 我们根据统计学习和扩展等级测试, 能够发现与单体( 未知的) 配置的全体假设的小偏差。 在设计适当时, 它在高维度的情况下, 它的研究是高维度的。 克服美元\ mathb{R ⁇ d$ 的自然顺序缺乏自然秩序, 它在两个样本中指定一个标签( 阳性对正比值的), 并且将它们分为两个部分, 以正值的平面的平面评分数函数 。 使用双面的正值的直值排序, 将原始的正值排序推算到直径直方的排序 。