Due to high mutation rates, COVID-19 evolved rapidly, and several variants such as Alpha, Gamma, Delta, Beta, and Omicron emerged with altered viral properties like the severity of the disease caused, transmission rates, etc. These variants burdened the medical systems worldwide and created a massive impact on the world economy as each had to be studied and dealt with in its specific ways. Unsupervised machine learning methods have the ability to compress, characterize, and visualize unlabelled data. In this paper, we present a framework that utilizes unsupervised machine learning methods to discriminate and visualize the associations between major COVID-19 variants based on their genome sequences. These methods comprise a combination of selected dimensionality reduction and clustering techniques. The framework processes the RNA sequences by performing a k-mer analysis on the data and then compares the results from different dimensionality reduction methods including: Principal Component Analysis (PCA), t-Distributed Stochastic Neighbour Embedding (t-SNE), and Uniform Manifold Approximation Projection (UMAP). Our framework also employs agglomerative hierarchical clustering to visualize the mutational differences among major variants of concern and country-wise mutational differences for a particular variant (Delta and Omicron) using dendrograms. We also provide country-wise mutational differences for selected variants via dendrograms. We conclude that the proposed framework can effectively distinguish between the major variants and hence can be used for the identification of emerging variants in the future.
翻译:由于突变率高,COVID-19迅速演变,一些变异,如阿尔法、伽玛、德尔塔、贝塔和奥米隆等,随着病毒特性的改变,出现了一些变异,如所引发疾病的严重性、传播率等。这些变异使全世界医疗系统负担沉重,对世界经济产生巨大影响,因为每个变异都需要研究和具体处理。不受监督的机器学习方法具有压缩、定性和可视化无标签数据的能力。在本文中,我们提出了一个框架,利用未经监督的机器学习方法,根据基因序列对主要COVID-19变异之间的关联进行区分和直观化。这些变异包括:主构分析(PCA),平流流模型(t-SNE),以及基于基因序列对主要 COVID 变异变异模型(UMAP),我们的框架还有效地利用了主要变异变变变模型,将主要变异模型用于当前变型变型变形变形变形变形变形变形变形变形变形。