As hypermethylation of promoter cytosine-guanine dinucleotide (CpG) islands has been shown to silence tumour suppressor genes, identifying differentially methylated CpG sites between different samples can assist in understanding disease. Differentially methylated CpG sites (DMCs) can be identified using moderated t-tests or nonparametric tests, but this requires the use of data transformations due to the lack of appropriate statistical methods able to adequately account for the bounded nature of DNA methylation data. We propose a family of beta mixture models (BMMs) which use a model-based approach to cluster CpG sites given their original beta-valued methylation data, with no need for transformations. The BMMs allow (i) objective inference of methylation state thresholds and (ii) identification of DMCs between different sample types. The BMMs employ different parameter constraints facilitating application to different study settings. Parameter estimation proceeds via an expectation-maximisation algorithm, with a novel approximation in the maximization step providing tractability and computational feasibility. Performance of BMMs is assessed through thorough simulation studies, and the BMMs are used to analyse a prostate cancer (PCa) dataset. The BMMs objectively infer intuitive and biologically interpretable methylation state thresholds, and identify DMCs that are related to genes implicated in carcinogenesis and involved in cancer related pathways. An R package betaclust facilitates widespread use of BMMs.
翻译:暂无翻译