Metastatic prostate cancer is one of the most common cancers in men. In the advanced stages of prostate cancer, tumours can metastasise to other tissues in the body, which is fatal. In this thesis, we performed a genetic analysis of prostate cancer tumours at different metastatic sites using data science, machine learning and topological network analysis methods. We presented a general procedure for pre-processing gene expression datasets and pre-filtering significant genes by analytical methods. We then used machine learning models for further key gene filtering and secondary site tumour classification. Finally, we performed gene co-expression network analysis and community detection on samples from different prostate cancer secondary site types. In this work, 13 of the 14,379 genes were selected as the most metastatic prostate cancer related genes, achieving approximately 92% accuracy under cross-validation. In addition, we provide preliminary insights into the co-expression patterns of genes in gene co-expression networks. Project code is available at https://github.com/zcablii/Master_cancer_project.
翻译:前列腺癌是男性常见的一种癌症。在前列腺癌的晚期阶段,肿瘤可能转移到身体的其他组织,这是致命的。在这个论文中,我们使用数据科学,机器学习和拓扑网络分析方法对不同转移部位的前列腺癌肿瘤进行了遗传学分析。我们提出了一个通用的流程来预处理基因表达数据集并使用分析方法预过滤显著基因。然后,我们使用机器学习模型进一步筛选重要基因和二次肿瘤分类。最后,我们对来自不同前列腺癌转移部位的样本进行了基因共表达网络分析和社区检测。在这个工作中,我们选取了14,379个基因中的13个作为最具代表性的前列腺癌基因,交叉验证下获得了约92%的准确率。此外,我们提供了基因共表达网络中基因的共表达模式的初步见解。该项目代码可在https://github.com/zcablii/Master_cancer_project上找到。