高效对等序列调整的一等模型适应性学习 (Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment)

Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. State-of-the-art approaches to speed up this task use hashing to identify short segments (k-mers) that are shared by pairs of reads, which can then be used to estimate alignment scores. However, when the number of reads is large, accurately estimating alignment scores for all pairs is still very costly. Moreover, in practice, one is only interested in identifying pairs of reads with large alignment scores. In this work, we propose a new approach to pairwise alignment estimation based on two key new ingredients. The first ingredient is to cast the problem of pairwise alignment estimation under a general framework of rank-one crowdsourcing models, where the workers' responses correspond to k-mer hash collisions. These models can be accurately solved via a spectral decomposition of the response matrix. The second ingredient is to utilise a multi-armed bandit algorithm to adaptively refine this spectral estimator only for read pairs that are likely to have large alignments. The resulting algorithm iteratively performs a spectral decomposition of the response matrix for adaptively chosen subsets of the read pairs.

翻译：在生物信息学中,DNA测序数据的对称匹配是一个无处不在的任务,通常是一种沉重的计算负担。加快这项任务的先进方法使用散列来识别由双读共享的短片段(k- mers),然后可以用来估计对齐分。但是,当读数大时,精确估计所有对齐数据对齐分仍然非常昂贵。此外,在实践中,人们只有兴趣识别具有大校正分数的对齐读数。在这项工作中,我们建议了一种基于两个关键新成分的对齐匹配估计的新方法。第一个要素是在一级群集包模型总框架下提出对齐匹配估计的问题, 使工人的反应与K-mer hash碰撞相对应。这些模型可以通过响应矩阵的光谱分解配置来准确解答。第二个要素是使用多臂波段算法来适应性地完善这个光谱估计器。我们在此工作中, 我们建议了一种基于两种关键新元素的对齐估计。第一个要素是在一级群集模型总框架下提出对齐的对齐调整估算问题。由此得出的算算法, 用于对子的光谱矩阵进行调整。