Mutational signatures are powerful summaries of the mutational processes altering the DNA of cancer cells and are increasingly relevant as biomarkers in personalized treatments. The widespread approach to mutational signature analysis consists of decomposing the matrix of mutation counts from a sample of patients via non-negative matrix factorization (NMF) algorithms. However, by working with aggregate counts, this procedure ignores the non-homogeneous patterns of occurrence of somatic mutations along the genome, as well as the tissue-specific characteristics that notoriously influence their rate of appearance. This gap is primarily due to a lack of adequate methodologies to leverage locus-specific covariates directly in the factorization. In this paper, we address these limitations by introducing a model based on Poisson point processes to infer mutational signatures and their activities as they vary across genomic regions. Using covariate-dependent factorized intensity functions, our Poisson process factorization (PPF) generalizes the baseline NMF model to include regression coefficients that capture the effect of commonly known genomic features on the mutation rates from each latent process. Furthermore, our method relies on sparsity-inducing hierarchical priors to automatically infer the number of active latent factors in the data, avoiding the need to fit multiple models for a range of plausible ranks. We present algorithms to obtain maximum a posteriori estimates and uncertainty quantification via Markov chain Monte Carlo. We test the method on simulated data and on real data from breast cancer, using covariates on alterations in chromosomal copies, histone modifications, cell replication timing, nucleosome positioning, and DNA methylation. Our results shed light on the joint effect that epigenetic marks have on the latent processes at high resolution.
翻译:突变特征是描述癌症细胞DNA中突变过程的有力总结,作为个性化治疗中的生物标志物日益重要。目前广泛采用的突变特征分析方法是通过非负矩阵分解(NMF)算法分解患者样本的突变计数矩阵。然而,该方法基于聚合计数进行操作,忽略了体细胞突变在基因组中出现的非均匀模式,以及众所周知影响其出现率的组织特异性特征。这一局限主要源于缺乏合适的方法来在因子分解中直接利用位点特异性协变量。本文通过引入基于泊松点过程的模型来推断突变特征及其在基因组区域间的活动变化,从而解决这些限制。利用协变量依赖的因子化强度函数,我们的泊松过程因子分解(PPF)将基线NMF模型推广至包含回归系数,这些系数捕捉了已知基因组特征对每个潜在过程突变率的影响。此外,我们的方法依赖稀疏诱导的层次先验来自动推断数据中活跃潜在因子的数量,避免了为一系列可能秩值拟合多个模型的需要。我们提出了通过马尔可夫链蒙特卡洛方法获得最大后验估计及不确定性量化的算法。我们在模拟数据和乳腺癌真实数据上测试了该方法,使用的协变量包括染色体拷贝数改变、组蛋白修饰、细胞复制时序、核小体定位和DNA甲基化。我们的结果揭示了表观遗传标记对高分辨率潜在过程的联合影响。