Given $M$ distributions ($M \geq 2$), defined on a general measurable space, we introduce a nonparametric (kernel) measure of multi-sample dissimilarity (KMD) -- a parameter that quantifies the difference between the $M$ distributions. The population KMD, which takes values between 0 and 1, is 0 if and only if all the $M$ distributions are the same, and 1 if and only if all the distributions are mutually singular. Moreover, KMD possesses many properties commonly associated with $f$-divergences such as the data processing inequality and invariance under bijective transformations. The sample estimate of KMD, based on independent observations from the $M$ distributions, can be computed in near linear time (up to logarithmic factors) using $k$-nearest neighbor graphs (for $k \ge 1$ fixed). We develop an easily implementable test for the equality of the $M$ distributions based on the sample KMD that is consistent against all alternatives where at least two distributions are not equal. We prove central limit theorems for the sample KMD, and provide a complete characterization of the asymptotic power of the test, as well as its detection threshold. The usefulness of our measure is demonstrated via real and synthetic data examples; our method is also implemented in an R package.
翻译:鉴于在一般可测量空间上定义的美元分布值(M M = geq 2美元),我们引入了多种抽样差异的非参数(内核)度量(KMD) -- -- 这个参数可以量化美元分布值之间的差数。KMD的数值在0到1之间,如果而且只有所有美元分布值都相同,只有所有分配值都是单一的,才为0美元分配值,只有所有分配值都是相同的,才为0美元;此外,KMD拥有许多通常与美元差异值相关的属性,例如数据处理不平等和双向转换中的差异。根据美元分布独立的观察,KMD的抽样估计值可以在接近线性的时间(直到对数值的计算系数)计算。KMD的抽样值,我们通过模拟式的测试,提供了我们实际检测方法的精确值。