In this paper, we test whether two datasets share a common clustering structure. As a leading example, we focus on comparing clustering structures in two independent random samples from two mixtures of multivariate normal distributions. Mean parameters of these normal distributions are treated as potentially unknown nuisance parameters and are allowed to differ. Assuming knowledge of mean parameters, we first determine the phase diagram of the testing problem over the entire range of signal-to-noise ratios by providing both lower bounds and tests that achieve them. When nuisance parameters are unknown, we propose tests that achieve the detection boundary adaptively as long as ambient dimensions of the datasets grow at a sub-linear rate with the sample size.
翻译:在本文中, 我们测试两个数据集是否共享一个共同的组群结构 。 举个例子, 我们侧重于比较两个独立的随机样本中的组群结构, 两个来自多种变式正常分布的混合物。 这些正常分布的平均参数被视为潜在的未知扰动参数, 并允许差异。 假设对平均参数的了解, 我们首先通过提供较低的界限和达到这些界限的测试来确定整个信号到噪音比率范围的测试问题的阶段图 。 当干扰参数未知时, 我们建议进行测试, 只要数据集的环境维度随抽样规模以亚线速增长, 就能在可适应的情况下达到探测边界 。