Due to the rapid evolution of the SARS-CoV-2 (COVID-19) virus, a number of mutations emerged with variants such as Alpha, Gamma, Delta and Omicron which created massive impact to the world economy. Unsupervised machine learning methods have the ability to compresses, characterize and visualises unlabelled data. In this paper, we present a framework that utilizes unsupervised machine learning methods that includes combination of selected dimensional reduction and clustering methods to discriminate and visualise the associations with the major COVID-19 variants based on genome sequences. The framework utilises k-mer analysis for processing the genome (RNA) sequences and compares different dimensional reduction methods, that include principal component analysis (PCA), and t-distributed stochastic neighbour embedding (t-SNE), and uniform manifold approximation projection (UMAP). Furthermore, the framework employs agglomerative hierarchical clustering methods and provides a visualisation using a dendogram. We find that the proposed framework can effectively distinguish the major variants and hence can be used for distinguishing emerging variants in the future.
翻译:由于SARS-COV-2(COVID-19)病毒的迅速演变,出现了一些变异,如阿尔法、伽马、德尔塔和奥米隆等变异体,对世界经济产生了巨大影响。不受监督的机器学习方法能够压缩、定性和可视化无标签数据。在本文件中,我们提出了一个框架,利用未经监督的机器学习方法,其中包括将选定的尺寸减少和集群方法结合起来,以区别和直观地显示与以基因组序列为基础的主要COVID-19变异体的关联。框架利用 k-mer 分析来处理基因组序列并比较不同的尺寸减少方法,其中包括主要组成部分分析(PCA)和多分布式相邻嵌入(t-SNE),以及统一组合预测(UMA)。此外,该框架还采用集聚式等级组合法,利用脱光图提供可视化的图像。我们发现,拟议的框架可以有效地区分主要变异体,从而可以用来区分未来的新兴变体。