The COVID-19 pandemic has lead to a worldwide effort to characterize its evolution through the mapping of mutations in the genome of the coronavirus SARS-CoV-2. Ideally, one would like to quickly identify new mutations that could confer adaptive advantages (e.g. higher infectivity or immune evasion) by leveraging the large number of genomes. One way of identifying adaptive mutations is by looking at convergent mutations, mutations in the same genomic position that occur independently. However, the large number of currently available genomes precludes the efficient use of phylogeny-based techniques. Here, we establish a fast and scalable Topological Data Analysis approach for the early warning and surveillance of emerging adaptive mutations based on persistent homology. It identifies convergent events merely by their topological footprint and thus overcomes limitations of current phylogenetic inference techniques. This allows for an unbiased and rapid analysis of large viral datasets. We introduce a new topological measure for convergent evolution and apply it to the GISAID dataset as of February 2021, comprising 303,651 high-quality SARS-CoV-2 isolates collected since the beginning of the pandemic. We find that topologically salient mutations on the receptor-binding domain appear in several variants of concern and are linked with an increase in infectivity and immune escape, and for many adaptive mutations the topological signal precedes an increase in prevalence. We show that our method effectively identifies emerging adaptive mutations at an early stage. By localizing topological signals in the dataset, we extract geo-temporal information about the early occurrence of emerging adaptive mutations. The identification of these mutations can help to develop an alert system to monitor mutations of concern and guide experimentalists to focus the study of specific circulating variants.
翻译:COVID-19大流行导致全世界努力通过测绘SARS-COV-2的冠状病毒基因组突变基因组中的突变来描述其演变。 理想的情况是,人们希望通过利用大量基因组,迅速发现能够带来适应优势的新变异(例如,更高的感染性或免疫规避),利用大量基因组来发现适应变异。 一种查明适应变异的方法是看趋同的突变,在同一基因组位置独立出现突变。 然而,大量现有的基因组无法有效地使用基于植物的变异技术。 在这里,我们建立了快速和可缩放的变异性变异性数据分析方法,用以对基于持久性同质的新出现的适应性变异性进行预警和监测。 仅仅通过它们的表面足迹来查明趋同性变异事件,从而克服了当前血源变异技术的局限性。 我们为趋同性变异性变异性基因组的变异性变异性变异性,对2021年2月的GISAID数据集进行了新的变异性演算, 包括303、651 高级感官变变变异性变变变变变变的我们开始在S-SAS-CS-C-C-C-C-C-CODVDVDVDMDMDM-C-C-C-C-S-SDVDFDDDDDFDFDDDDDDDDDDDDDDDDDDDDDDDDDD 开始开始开始的变变变变变变变变。