切制和序列对齐:比率扭曲观点 (Sketching and Sequence Alignment: A Rate-Distortion Perspective)

Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. A standard approach to speed up this task is to compute "sketches" of the DNA reads (typically via hashing-based techniques) that allow the efficient computation of pairwise alignment scores. We propose a rate-distortion framework to study the problem of computing sketches that achieve the optimal tradeoff between sketch size and alignment estimation distortion. We consider the simple setting of i.i.d. error-free sources of length $n$ and introduce a new sketching algorithm called "locational hashing." While standard approaches in the literature based on min-hashes require $B = (1/D) \cdot O\left( \log n \right)$ bits to achieve a distortion $D$, our proposed approach only requires $B = \log^2(1/D) \cdot O(1)$ bits. This can lead to significant computational savings in pairwise alignment estimation.

翻译：在生物信息学中,DNA测序数据的对称对齐是一个普遍存在的任务,通常是一种沉重的计算负担。加速这项任务的标准方法是计算DNA的“切片”读数(通常通过散射法计算),以便有效地计算对齐排序分数。我们提出一个率扭曲框架,研究计算草图的问题,从而在草图大小和对齐估计扭曲之间实现最佳的权衡。我们考虑的是简单设置i.d.无误长度源$n,并引入一个新的绘图算法,称为“定位散射 ” 。而基于 min-hashes 的文献中标准方法需要$B = (1/D)\ cddd\ cdot O\left (\log n\right) 位元才能实现扭曲 $D,我们拟议的方法只需要$B =\log2/1/D)\cd,\cdot O(1) 位元。这可以导致对齐校准估计的计算节约。