The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, but they are often overdispersed, and thus the Negative Binomial distribution is more appropriate. We demonstrate using a simulation study that Negative Binomial NMF requires fewer signatures than Poisson NMF to fit the data and we propose a Negative Binomial NMF with a patient specific overdispersion parameter to capture the variation across patients. We also introduce a robust model selection procedure inspired by cross-validation to determine the number of signatures. Furthermore we study the influence of the distributional assumption in relation to two classical model selection procedures: the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). In the presence of overdispersion we show that our model selection procedure is more robust at determining the correct number of signatures than state-of-the-art methods, which are overestimating the number of signatures. We apply our proposed analysis on a wide range of simulated data and on a data set from breast cancer patients. The code for our algorithms and analysis is available in the R package SigMoS and can be found at https://github.com/MartaPelizzola/SigMoS.
翻译:癌症基因组集合中的突变频谱可以通过几种突变特征的混合组合来描述。 突变特征可以通过非负矩阵因子化( NMF) 来找到。 要提取突变特征, 我们必须假设观察到突变计数的分布和一些突变特征。 在大多数应用中, 突变计被假定为Poisson 分布, 但突变计往往分布过度, 但这些突变分布更为合适 。 我们使用模拟研究 显示, 负Binomial NMF 需要比 Poisson NMF 更少的签名来适应数据, 我们提出一个负比 负的 负 Binomial NMF, 并配有病人特定的超分散参数来捕捉病人的变异性。 我们还引入了一个强有力的模型选择程序, 由交叉校验来决定签名的数量。 此外, 我们研究分布假设对两种典型模式选择程序的影响: Akaike 信息标准( AIC) 和 Bayes 信息标准( BIC ) 。 在过分偏差的患者中, 我们的模型选择程序上, 我们的模型选择S- 范围的模型/ 分析方法比我们的数据序列分析范围要更可靠。