Metagenomics is an emerging field of molecular biology concerned with analyzing the genomes of environmental samples comprising many different diverse organisms. Given the nature of metagenomic data, one usually has to sequence the genomic material of all organisms in a batch, leading to a mix of reads coming from different DNA sequences. In deep high-throughput sequencing experiments, the volume of the raw reads is extremely high, frequently exceeding 600 Gb. With an ever increasing demand for storing such reads for future studies, the issue of efficient metagenomic compression becomes of paramount importance. We present the first known approach to metagenome read compression, termed MCUIUC (Metagenomic Compression at UIUC). The gist of the proposed algorithm is to perform classification of reads based on unique organism identifiers, followed by reference-based alignment of reads for individually identified organisms, and metagenomic assembly of unclassified reads. Once assembly and classification are completed, lossless reference based compression is performed via positional encoding. We evaluate the performance of the algorithm on moderate sized synthetic metagenomic samples involving 15 randomly selected organisms and describe future directions for improving the proposed compression method.
翻译:元基因组学是一个新兴的分子生物学领域,涉及分析由多种不同生物组成的环境样品的基因组。鉴于元基因学数据的性质,通常必须对所有生物的基因组材料进行分批排序,从而混合不同DNA序列的读数。在深层的高通量测序实验中,原读数极高,常常超过600千兆b。随着对储存这种读数进行未来研究的需求不断增加,高效的元基因压缩问题变得极为重要。我们提出了已知的首个元基因读压缩方法,称为MMIUUUC(UIUC的Metomemomic Conpression ) 。提议的算法的基点是根据独特的生物特性对读数进行分类,然后根据参考对个别确定的生物进行分类,然后对未分类的读数进行元组装配。一旦完成组装配和分类,则通过定位编码进行无损的参考压缩。我们评估了涉及15个随机选定生物的中小合成合成基因样品的算法的性,并描述改进拟议压缩方法的未来方向。