We propose a new compression scheme for genomic data given as sequence fragments called reads. The scheme uses a reference genome at the decoder side only, freeing the encoder from the burdens of storing references and performing computationally costly alignment operations. The main ingredient of the scheme is a multi-layer code construction, delivering to the decoder sufficient information to align the reads, correct their differences from the reference, validate their reconstruction, and correct reconstruction errors. The core of the method is the well-known concept of distributed source coding with decoder side information, fortified by a generalized-concatenation code construction enabling efficient embedding of all the information needed for reliable reconstruction. We first present the scheme for the case of substitution errors only between the reads and the reference, and then extend it to support reads with a single deletion and multiple substitutions. A central tool in this extension is a new distance metric that is shown analytically to improve alignment performance over existing distance metrics.
翻译:我们建议了一个新的基因组数据压缩计划, 以序列碎片形式提供。 这个计划仅在解码器侧使用参考基因组, 将编码器从存储引用的重担中解脱出来, 并进行计算成本高昂的校对操作。 这个计划的主要成分是多层代码构建, 向解码器提供足够的信息, 以校正读数、 校正与引用的差别、 校正其重建, 并纠正重建错误。 这个方法的核心是众所周知的分布源代码概念, 用解码器侧信息进行分布源代码编码, 并辅之以通用编码构建, 以便有效地嵌入可靠重建所需的所有信息。 我们首先提出替换错误方案, 仅在读和引用之间提出, 然后将其扩展为支持, 使用单一删除和多重替换。 这个扩展的中央工具是一个新的距离参数, 以分析方式显示, 以改善现有远程测量的校正性。