Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally-inefficient and memory-hungry; bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first fast and widely-applicable pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. TargetCall filters out all off-target reads before basecalling; and the highly-accurate but slow basecalling is performed only on the raw signals whose noisy reads are labeled as on-target. Our thorough experimental evaluations using both real and simulated data show that TargetCall 1) improves the end-to-end basecalling performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88%) sensitivity in keeping on-target reads, 2) maintains high accuracy in downstream analysis, 3) precisely filters out up to 94.71% of off-target reads, and 4) achieves better performance, sensitivity, and generality compared to prior works. We freely open-source TargetCall at https://github.com/CMU-SAFARI/TargetCall.
翻译:基础呼叫是纳米波测序分析的一个必要步骤, 其中纳米波序列器的原始信号被转换为核糖核酸序列, 即读作。 最先进的基础呼叫员采用复杂的深层次学习模型, 以达到高调底调准确性。 这使得基准呼叫效率低, 记忆饥饿; 抑制整个基因组分析管道。 但是, 对于许多应用程序, 大多数读数都不符合相关参考基因组( 即 目标参照度 ), 从而在基因组管道的较后步骤中被丢弃, 浪费计算基调。 为了克服这一问题, 我们建议 目标Call, 第一个快速且广泛适用的预呼叫过滤器, 以消除在计算基调时的浪费计算。 目标C 关键想法是丢弃时, 与目标参考值( 即, 离标数/ 普通的离标数 ) 相同。 目标C 由两个主要组件组成:(1) LightCall, 一个轻质的网络基调基调, 能够发出响音; (2) 类似检查, 每个目标C 显示每个目标直径直径, 保持这些直径直径, 直径, 直径, 直达, 将这些直径, 直径, 直达, 直达, 直达, 直达。