We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in unsupervised settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis.
翻译:我们提出了数据稀疏化方法,一种将观测值分成两个或多个独立部分并使它们在卷积封闭分布下与原始观测值相同的方法,直到参数的比例因子已知。这个非常通用的方法适用于任何卷积封闭分布,这个类别包括高斯、泊松、负二项、伽马和二项式分布等。数据稀疏化方法有许多应用于模型选择、评估和推断。例如,通过数据稀疏化进行交叉验证提供了一个有吸引力的替代方案,对于无监督的设置,这个方法比通常的样本划分交叉验证更适用。在模拟和应用于单细胞RNA测序数据的情况下,我们展示了数据稀疏化方法可以用于验证无监督学习方法的结果,例如k-均值聚类和主成分分析。