A standard format used for storing the output of high-throughput sequencing experiments is the FASTQ format. It comprises three main components: (i) headers, (ii) bases (nucleotide sequences), and (iii) quality scores. FASTQ files are widely used for variant calling, where sequencing data are mapped into a reference genome to discover variants that may be used for further analysis. There are many specialized compressors that exploit redundancy in FASTQ data with the focus only on either the bases or the quality scores components. In this paper we consider the novel problem of lossy compressing, in a reference-free way, FASTQ data by modifying both components at the same time, while preserving the important information of the original FASTQ. We introduce a general strategy, based on the Extended Burrows-Wheeler Transform (EBWT) and positional clustering, and we present implementations in both internal memory and external memory. Experimental results show that the lossy compression performed by our tool is able to achieve good compression while preserving information relating to variant calling more than the competitors. Availability: the software is freely available at https://github.com/veronicaguerrini/BFQzip.
翻译:摘要:用于存储高通量测序实验输出的标准格式是FASTQ格式。它包括三个主要组件:(i)标题,(ii)碱基(核酸序列)和(iii)质量得分。FASTQ文件被广泛用于变异调用,其中测序数据被映射到参考基因组中以发现可用于进一步分析的变异体。许多专业压缩器利用FASTQ数据中的冗余,只关注碱基或质量得分组件。在本文中,我们考虑了一种新颖的问题,即在不使用参考文件的情况下通过同时修改两个组件来对FASTQ数据进行有损压缩,同时保留原始FASTQ的重要信息。我们提出了一种基于扩展Burrows-Wheeler变换(EBWT)和位置聚类的通用策略,并分别在内存和外部存储器中进行了实现。实验结果表明,我们的工具进行的有损压缩能够实现较好的压缩,同时比竞争对手更好地保留与变异调用相关的信息。可用性:该软件可在https://github.com/veronicaguerrini/BFQzip免费获得。