Model selection is critical in the modern statistics and machine learning community. However, most existing works do not apply to heavy-tailed data, which are commonly encountered in real applications, such as the single-cell multiomics data. In this paper, we propose a rank-sum based approach that outputs a confidence set containing the optimal model with guaranteed probability. Motivated by conformal inference, we developed a general method that is applicable without moment or tail assumptions on the data. We demonstrate the advantage of the proposed method through extensive simulation and a real application on the COVID-19 genomics dataset (Stephenson et al., 2021). To perform the inference on rank-sum statistics, we derive a general Gaussian approximation theory for high dimensional two-sample U-statistics, which may be of independent interest to the statistics and machine learning community.
翻译:暂无翻译