While the cost of sequencing genomes has decreased dramatically in recent years, this expense often remains non-trivial. Under a fixed budget, then, scientists face a natural trade-off between quantity and quality; they can spend resources to sequence a greater number of genomes (quantity) or spend resources to sequence genomes with increased accuracy (quality). Our goal is to find the optimal allocation of resources between quantity and quality. Optimizing resource allocation promises to reveal as many new variations in the genome as possible, and thus as many new scientific insights as possible. In this paper, we consider the common setting where scientists have already conducted a pilot study to reveal variants in a genome and are contemplating a follow-up study. We introduce a Bayesian nonparametric methodology to predict the number of new variants in the follow-up study based on the pilot study. When experimental conditions are kept constant between the pilot and follow-up, we demonstrate on real data from the gnomAD project that our prediction is more accurate than three recent proposals, and competitive with a more classic proposal. Unlike existing methods, though, our method allows practitioners to change experimental conditions between the pilot and the follow-up. We demonstrate how this distinction allows our method to be used for (i) more realistic predictions and (ii) optimal allocation of a fixed budget between quality and quantity.
翻译:虽然测序基因组的成本近年来急剧下降,但这种开支往往不是三重开支,在固定预算下,科学家面临质与量之间的自然权衡;他们可以花费资源对更多的基因组(数量)进行排序,或者将资源用在测序上,提高精确度(质量),我们的目标是在质与量之间找到资源的最佳分配。优化资源分配将尽可能多地揭示基因组中的新变化,从而尽可能多地揭示新的科学见解。在本文中,我们考虑了科学家已经进行实验研究以揭示基因组变异的常见环境,并正在考虑进行后续研究。我们采用巴伊西亚非参数方法来预测后续研究中新的变异的数量。当实验条件在试验和跟踪之间保持不变时,我们用GnomAD项目的真实数据显示,我们的预测比最近提出的三项建议更准确,并且与更经典的建议相比,我们的方法允许从业者改变试验条件,将试验和最佳的预测结果与最佳的计算方法区分(我们如何区分),我们用这个方法可以将试验和最佳的预测方法区分为最佳的定数。