The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making. Such data will come from heterogeneous sources: aligned, unaligned, or even unassembled raw nucleotide or amino acid sequencing reads pertaining to the whole genome or regions (e.g., spike) of interest. In this work, we propose \emph{ViralVectors}, a compact feature vector generation from virome sequencing data that allows effective downstream analysis. Such generation is based on \emph{minimizers}, a type of lightweight "signature" of a sequence, used traditionally in assembly and read mapping -- to our knowledge, the first use minimizers in this way. We validate our approach on different types of sequencing data: (a) 2.5M SARS-CoV-2 spike sequences (to show scalability); (b) 3K Coronaviridae spike sequences (to show robustness to more genomic variability); and (c) 4K raw WGS reads sets taken from nasal-swab PCR tests (to show the ability to process unassembled reads). Our results show that ViralVectors outperforms current benchmarks in most classification and clustering tasks.
翻译:----
随着新冠病毒基因测序数据量的指数级增长,从整个基因组或重要区域(如刺突)的已对准、未对准甚至未组装的原始核苷酸或氨基酸测序读数等异构来源收集大量测序数据来支持有效但及时的决策-making的方法对于提升整个新冠病毒和其他病毒的基因组监测非常重要。在本研究中,我们提出了一种称为“病毒载体”(ViralVectors)的紧凑特征向量方法来从生物体组测序数据中提取信息,以支持有效的后续分析。这种生成基于最小化(minimizers),这是一种轻量级的序列“签名”,在组装和读取映射中传统上进行使用,我们认为是第一个在这种方式中使用最小化方法。我们在不同类型的测序数据上验证了我们的方法:(a)250万个SARS-CoV-2刺突序列(以显示可扩展性);(b)3,000个冠状病毒刺突序列(以显示对更多基因组可变性的稳健性);以及(c)从鼻拭子PCR测试中采集的4K个原始全基因组序列读数集(以显示处理未组装读取的能力)。我们的结果表明,ViralVectors在大多数分类和聚类任务中优于当前基准。