Motivation: A Genomic Dictionary, i.e., the set of the k-mers appearing in a genome, is a fundamental source of genomic information: its collection is the first step in strategic computational methods ranging from assembly to sequence comparison and phylogeny. Unfortunately, it is costly to store. This motivates some recent studies regarding the compression of those k-mer sets. However, such an area does not have the maturity of genomic compression, lacking an homogeneous and methodologically sound experimental foundation that allows to fairly compare the relative merits of the available solutions, and that takes into account also the rich choices of compression methods that can be used. Results: We provide such a foundation here, supporting it with an extensive set of experiments that use reference datasets and a carefully selected set of representative data compressors. Our results highlight the spectrum of compressor choices one has in terms of Pareto Optimality of compression vs. post-processing, this latter being important when the Dictionary needs to be decompressed many times. In addition to the useful indications, not available elsewhere, that this study offers to the researchers interested in storing k-mer dictionaries in compressed form, a software system that can be readily used to explore the Pareto Optimal solutions available r a given Dictionary is also provided. Availability: The software system is available at https://github.com/GenGrim76/Pareto-Optimal-GDC, together with user manuals and installation instructions. Contact: raffaele.giancarlo@unipa.it Supplementary information: Additional data are available in the Supplementary Material.
翻译:动力: 基因组词典, 也就是基因组中出现的 k- 模子集, 是基因组信息的基本来源: 其收集是从组装到序列比较和血压分析等一系列战略计算方法的第一步。 不幸的是, 存储成本很高 。 这鼓励了最近一些关于压缩 k- mer 组的研究。 然而, 这样一个区域没有基因组压缩的成熟性, 缺乏一个能够公平比较现有解决方案相对优点的、 在方法上健全的实验基础, 并且还考虑到可以使用的压缩方法的丰富选择。 结果: 我们在这里提供这样一个基础, 支持它, 包括一系列广泛的实验, 使用参考数据集和一组精心选择的具有代表性的数据压缩器。 我们的结果突出了一个压缩器选择的频谱。 后处理, 当调制解压缩器需要解压时, 后一个非常重要。 除了有用的指示外, 其他地方没有可用的压缩法系/ 。 本研究会提供一个可快速存储的系统 。 该软件在服务器上可以随时存储的 。