Linear functions of the site frequency spectrum (SFS) play a major role for understanding and investigating genetic diversity. Estimators of the mutation rate (e.g. based on the total number of segregating sites or average of the pairwise differences) and tests for neutrality (e.g. Tajima's D) are perhaps the most well-known examples. The distribution of linear functions of the SFS is important for constructing confidence intervals for the estimators, and to determine significance thresholds for neutrality tests. These distributions are often approximated using simulation procedures. In this paper we use multivariate phase-type theory to specify, characterize and calculate the distribution of linear functions of the site frequency spectrum. In particular, we show that many of the classical estimators of the mutation rate are distributed according to a discrete phase-type distribution. Neutrality tests, however, are generally not discrete phase-type distributed. For neutrality tests we derive the probability generating function using continuous multivariate phase-type theory, and numerically invert the function to obtain the distribution. A main result is an analytically tractable formula for the probability generating function of the SFS. Software implementation of the phase-type methodology is available in the R package phasty, and R code for the reproduction of our results is available as an accompanying vignette.
翻译:站点频率频谱的线性功能(SFS)在理解和调查遗传多样性方面起着主要作用。对突变率(例如,基于分离点的总数或对称差异的平均值)和中性测试(例如,Tajima's D)的测算器可能是最广为人知的例子。SFS线性功能的分布对于为测算器建立信任间隔和确定中性测试的临界值十分重要。这些分布往往使用模拟程序进行近似。在本文中,我们使用多变量级类型理论来说明、描述和计算站点频谱频谱线性功能的分布。特别是,我们表明,许多典型的突变率估计器(例如,Tajima's D)的分布是根据离相级分布的。但是,中性测试通常不是分散的级类型。对于中性测试,我们用连续的多变量级理论来得出概率生成函数,而对于获取分布功能则以数字方式进行。主要结果是,在可分析的可导式公式中,用于产生SFSFS号系统模型的复制结果的概率序列。