As genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping). However, if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4x. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants.
翻译:随着基因组测序工具和技术的改善,研究人员能够逐步收集更准确的参考基因组,从而在阅读绘图和下游分析(如变式调用)中保持敏感。更敏感的下游分析对于更好地了解基因组捐献者(如健康特征)至关重要。因此,从序列样本中绘制的读数最好应该映射到代表最相关人群的最新现有参考基因组。不幸的是,由于现有基因组数据的数量越来越多,每次更新参考数据时,每将每个组完全重新映射到各自的参考基因组,成本都惊人地高得令人望而却步。为了应对现有工具中的这一重大限制,我们提议从一个引用中更新一个读数据集(即重新映射)到另一个引用中没有合理相似程度的旧参照区域,读数组中的读数组将无法重新映射。我们发现,由于这一缩影,在使用最新版本的重新映射工具时,每个图都损失了相当一部分。为了应对现有工具中的这一主要限制,我们提议从AirLift, 快速和综合技术将一个基因组的读取结果到另一个重新定位。