Motivation: Despite significant advances in Third-Generation Sequencing (TGS) technologies, Next-Generation Sequencing (NGS) technologies remain dominant in the current sequencing market. This is due to the lower error rates and richer analytical software of NGS than that of TGS. NGS technologies generate vast amounts of genomic data including short reads, quality values and read identifiers. As a result, efficient compression of such data has become a pressing need, leading to extensive research efforts focused on designing FASTQ compressors. Previous researches show that lossless compression of quality values seems to reach its limits. But there remain lots of room for the compression of the reads part. Results: By investigating the characters of the sequencing process, we present a new algorithm for compressing reads in FASTQ files, which can be integrated into various genomic compression tools. We first reviewed the pipeline of reference-based algorithms and identified three key components that heavily impact storage: the matching positions of reads on the reference sequence(refpos), the mismatched positions of bases on reads(mispos) and the matching failed reads(unmapseq). To reduce their sizes, we conducted a detailed analysis of the distribution of matching positions and sequencing errors and then developed the three modules of AMGC. According to the experiment results, AMGC outperformed the current state-of-the-art methods, achieving an 81.23% gain in compression ratio on average compared with the second-best-performing compressor.
翻译:动机:尽管第三代测序(TGS)技术取得了显著的进展,但比起TGS来说,下一代测序(NGS)技术在当前测序市场上仍然占主导地位,这是由于NGS的比TGS更低的错误率和更丰富的分析软件。 NGS技术产生了大量的基因组数据,包括短读取、质量值和读取标识符。因此,有效压缩这种数据已成为迫切需要,导致广泛的研究工作致力于设计FASTQ压缩器。以往的研究表明,质量值的无损压缩似乎已达到了极限。但对于reads部分的压缩仍具有较大的潜力。结果:通过调查测序过程的特征,我们提出了一种新的用于压缩FASTQ文件中reads的算法,该算法可以集成到各种基因组压缩工具中。我们首先审查了基于参考序列的算法的流程,并确定了三个关键组成部分,这些组成部分对存储产生了重大影响:reads在参考序列上的匹配位置(refpos)、reads上的错配位置(mispos)和匹配失败的reads(unmapseq)。为了减少它们的大小,我们对匹配位置和测序错误的分布进行了详细的分析,然后开发了AMGC的三个模块。根据实验结果,AMGC优于现有最先进的方法,平均压缩率相比于次佳性能压缩器提高了81.23%。