High-dimensional data, where the dimension of the feature space is much larger than sample size, arise in a number of statistical applications. In this context, we construct the generalized multivariate sign transformation, defined as a vector divided by its norm. For different choices of the norm function, the resulting transformed vector adapts to certain geometrical features of the data distribution. Building up on this idea, we obtain one-sample and two-sample testing procedures for mean vectors of high-dimensional data using these generalized sign vectors. These tests are based on U-statistics using kernel inner products, do not require prohibitive assumptions, and are amenable to a fast randomization-based implementation. Through experiments in a number of data settings, we show that tests using generalized signs display higher power than existing tests, while maintaining nominal type-I error rates. Finally, we provide example applications on the MNIST and Minnesota Twin Studies genomic data.
翻译:高维数据,其中地物空间的维度大大大于样本大小,产生于若干统计应用。在这方面,我们构建了通用多变量符号转换,定义为按其规范划分的矢量。对于规范函数的不同选择,由此产生的转变矢量会适应数据分布的某些几何特征。基于这一理念,我们获得了使用这些通用标志矢量的高维数据中值矢量的一模和两模测试程序。这些测试基于使用内核内产物的U-统计学,不需要令人望而却步的假设,并且可以快速随机实施。我们通过在一系列数据环境下的实验,显示使用通用标志的测试显示比现有测试高的功率,同时保持标称型型型型I误差率。最后,我们提供了MNISIC和明尼苏达的双项研究基因数据应用实例。