We propose data thinning, a new approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This proposal is very general, and can be applied to any observation drawn from a "convolution closed" distribution, a class that includes the Gaussian, Poisson, negative binomial, Gamma, and binomial distributions, among others. It is similar in spirit to -- but distinct from, and more easily applicable than -- a recent proposal known as data fission. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the "usual" approach of cross-validation via sample splitting, especially in unsupervised settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis.
翻译:我们提出数据稀释,这是将观测分为两个或更多独立部分的新办法,与原始观测相提并论,并遵循与原始观测相同的分布,直至参数的(已知的)缩放。这个提议非常笼统,可以适用于从“革命封闭”分布中得出的任何观测,该分类包括Gaussian、Poisson、负二进制、Gamma和二进制分布等。这在精神上与最近一项称为数据裂变的提案相类似 -- -- 但区别于和更容易适用。数据稀释有许多适用于模型选择、评估和推论。例如,通过数据稀释的交叉校验提供了一种有吸引力的替代方法,可替代通过样本分离进行交叉校验的“常规”方法,特别是在不受监督的环境下,后者不适用。在模拟和单细胞RNA测算数据的应用中,我们表明数据稀释可以用来验证未经监视的学习方法的结果,例如K means群集和主要分析。