In computational biology, $k$-mers and edit distance are fundamental concepts. However, little is known about the metric space of all $k$-mers equipped with the edit distance. In this work, we explore the structure of the $k$-mer space by studying its maximal independent sets (MISs). An MIS is a sparse sketch of all $k$-mers with nice theoretical properties, and therefore admits critical applications in clustering, indexing, hashing, and sketching large-scale sequencing data, particularly those with high error-rates. Finding an MIS is a challenging problem, as the size of a $k$-mer space grows geometrically with respect to $k$. We propose three algorithms for this problem. The first and the most intuitive one uses a greedy strategy. The second method implements two techniques to avoid redundant comparisons by taking advantage of the locality-property of the $k$-mer space and the estimated bounds on the edit distance. The last algorithm avoids expensive calculations of the edit distance by translating the edit distance into the shortest path in a specifically designed graph. These algorithms are implemented and the calculated MISs of $k$-mer spaces and their statistical properties are reported and analyzed for $k$ up to 15. Source code is freely available at https://github.com/Shao-Group/kmerspace .
翻译:关于编辑距离下k-mer的最大独立集
翻译后的摘要:
在计算生物学中,k-mer和编辑距离是基本概念。 然而,对所有k-mer的度量空间配备编辑距离的结构知之甚少。 在这项工作中,我们通过研究其最大独立集(MIS)来探索k-mer空间的结构。 MIS是所有k-mer的稀疏草图,具有良好的理论特性,因此在聚类、索引、哈希和草绘大规模测序数据(特别是高误差率数据)方面具有关键应用。 找到一个MIS是一个具有挑战性的问题,因为k-mer空间的大小随k呈几何增长。 我们提出了三种算法来解决这个问题。 第一个和最直观的方法使用贪婪策略。 第二种方法实现了两种技术,通过利用k-mer空间的本地性质和对编辑距离的估计边界来避免冗余比较。 最后一个算法通过将编辑距离转换为特别设计的图中的最短路径来避免昂贵的编辑距离计算。 这些算法已经实现,并且计算出的k-mer空间的MIS及其统计特性已经报告和分析,对于k到15。 源代码可以免费在https://github.com/Shao-Group/kmerspace上获得。