AirLift:参考基因组间重新绘图调整快速综合技术 (AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes)

As genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping). However, if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4x. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants.

翻译：随着基因组测序工具和技术的改善,研究人员能够逐步收集更准确的参考基因组,从而在阅读绘图和下游分析(如变式调用)中保持敏感。更敏感的下游分析对于更好地了解基因组捐献者(如健康特征)至关重要。因此,从序列样本中绘制的读数最好应该映射到代表最相关人群的最新现有参考基因组。不幸的是,由于现有基因组数据的数量越来越多,每次更新参考数据时,每将每个组完全重新映射到各自的参考基因组,成本都惊人地高得令人望而却步。为了应对现有工具中的这一重大限制,我们提议从一个引用中更新一个读数据集(即重新映射)到另一个引用中没有合理相似程度的旧参照区域,读数组中的读数组将无法重新映射。我们发现,由于这一缩影,在使用最新版本的重新映射工具时,每个图都损失了相当一部分。为了应对现有工具中的这一主要限制,我们提议从AirLift, 快速和综合技术将一个基因组的读取结果到另一个重新定位。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日