Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. The estimation procedure is based on the expectation--maximization (EM) algorithm and regression in the log-linear quasi--Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework on three data sets of somatic mutation counts from cancer patients.
翻译:癌症中的Social 突变可被视为几种突变性特征的混合分布,可以用非负矩阵因子化(NMF)来推断。 突变性特征以前曾使用简单的单核极化相互作用模型或一般的三核极酸相互作用模型来进行对称。 我们描述的是一个灵活和新颖的框架,用以确定突变性特征的生物上可信的相异性,特别是用于估计二核极化相互作用模型。 估计程序依据的是对数-最大化(EM)算法和对数准-Poisson模型的回归。 我们显示,二核极化相互作用在统计上是稳定的,非常复杂,足以适应突变模式。 二核极化相互作用往往在适当适应数据和避免过度适应之间取得正确的平衡。 它们更适合数据,在生物上比单核极化相互作用特征更可信,而复位化比参数丰富三核极化病人相互作用特征更稳定。 我们从三个数据组中说明了我们关于突变式癌症统计的框架。