The high-throughput short-reads RNA-seq protocols often produce paired-end reads, with the middle portion of the fragments being unsequenced. We explore if the full-length fragments can be computationally reconstructed from the sequenced two ends in the absence of the reference genome - a problem here we refer to as de novo bridging. Solving this problem provides longer, more informative RNA-seq reads, and benefits downstream RNA-seq analysis such as transcript assembly, expression quantification, and splicing differential analysis. However, de novo bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data provides sufficient information for accurate bridging, let alone efficient algorithms that determine the true bridges. Methods have been proposed to bridge paired-end reads in the presence of reference genome (called reference-based bridging), but the algorithms are far away from scaling for de novo bridging as the underlying compacted de Bruijn graph(cdBG) used in the latter task often contains millions of vertices and edges. We designed a new truncated Dijkstra's algorithm for this problem, and proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra's algorithm from scratch for all vertices for further speeding up. These innovative techniques result in scalable algorithms that can bridge all paired-end reads in a cdBG with millions of vertices. Our experiments showed that paired-end RNA-seq reads can be accurately bridged to a large extent. The resulting tool is freely available at https://github.com/Shao-Group/rnabridge-denovo.
翻译:高通量短读长RNA-seq协议通常产生成对末端读取数据,其中片段中间部分未测序。本文探讨了在缺少参考基因组的情况下是否可以从已测两端计算重建完整长度的片段,这是一个称为原创桥接的问题。解决这个问题可以提供更长、更有信息量的RNA-seq读取,有利于下游RNA-seq分析如转录组装、表达量定量和剪接差异分析。然而,由于另类剪接、转录本噪声和测序误差,原创桥接是一个具有挑战性和复杂性的任务。目前尚不清楚数据是否提供足够的信息进行准确的桥接,甚至不知道是否存在高效的算法来确定真正的桥接。已经提出了一些方法来桥接参考基因组存在的配对末端读取数据(称为基于参考基因组的桥接),但这些算法还远远不能扩展到原创桥接,因为后者任务中使用的压缩de Bruijn图(cdBG)通常包含数百万个顶点和边。我们设计了一种新的截断Dijkstra算法来解决这个问题,并提出了一种利用最短路径树进行重用的新算法,以避免从头开始运行所有顶点的截断Dijkstra算法以进一步加速。这些创新技术产生了可扩展的算法,可以在包含数百万个顶点的cdBG中桥接所有配对末端读取数据。我们的实验表明,配对的RNA-seq读取可以被准确桥接到相当大的程度。所得到的工具可在https://github.com/Shao-Group/rnabridge-denovo上免费使用。