How many samples should one collect for an empirical distribution to be as close as possible to the true population? This question is not trivial in the context of single-cell RNA-sequencing. With limited sequencing depth, profiling more cells comes at the cost of fewer reads per cell. Therefore, one must strike a balance between the number of cells sampled and the accuracy of each measured gene expression profile. In this paper, we analyze an empirical distribution of cells and obtain upper and lower bounds on the Wasserstein distance to the true population. Our analysis holds for general, non-parametric distributions of cells, and is validated by simulation experiments on a real single-cell dataset.
翻译:经验分布需要采集多少样本才能尽可能接近真实总体?在单细胞RNA测序背景下,这个问题并非无关紧要。在测序深度有限的情况下,分析更多细胞意味着每个细胞获得的测序读长更少。因此,必须在采样细胞数量与每个基因表达谱的测量精度之间取得平衡。本文通过分析细胞的经验分布,获得了其与真实总体之间Wasserstein距离的上界和下界。我们的分析适用于一般的非参数细胞分布,并通过真实单细胞数据集的模拟实验得到验证。