We present a general framework for hypothesis testing on distributions of sets of individual examples. Sets may represent many common data sources such as groups of observations in time series, collections of words in text or a batch of images of a given phenomenon. This observation pattern, however, differs from the common assumptions required for hypothesis testing: each set differs in size, may have differing levels of noise, and also may incorporate nuisance variability, irrelevant for the analysis of the phenomenon of interest; all features that bias test decisions if not accounted for. In this paper, we propose to interpret sets as independent samples from a collection of latent probability distributions, and introduce kernel two-sample and independence tests in this latent space of distributions. We prove the consistency of tests and observe them to outperform in a wide range of synthetic experiments. Finally, we showcase their use in practice with experiments of healthcare and climate data, where previously heuristics were needed for feature extraction and testing.
翻译:我们为对各组个别实例的分布进行假设测试提供了一个总体框架。各组可能代表许多共同的数据来源,如时间序列观测组、文字文字文字收集或某一现象的一组图像。然而,这种观察模式不同于假设测试所要求的共同假设:每组不同大小,噪音程度不同,还可能包含骚扰性变化,与分析感兴趣的现象无关;偏见测试决定的所有特征,如果没有计算在内。我们提议将各组作为独立样本从潜在概率分布收集中解释,并在这一潜在分布空间中引入内核双模和独立测试。我们证明测试的一致性,并观察测试在广泛的合成实验中优于这些测试。最后,我们展示了这些测试在卫生和气候数据实验中的实用性,在特征提取和测试中以前需要超自然特征的实验中。