Predicting the consensus structure of a set of aligned RNA homologs is a convenient method to find conserved structures in an RNA genome, which has applications in SARS-CoV-2 diagnostics and therapeutics. However, the state-of-the-art algorithm for this task, RNAalifold, is prohibitively slow for long sequences, due to a cubic scaling with the sequence length, and even slower when analyzing many such sequences, due to a superlinear scaling with the number of homologs, taking 4 days on 200 SARS-CoV variants. We present LinearAlifold, an efficient algorithm for folding aligned RNA homologs that scales linearly with both the sequence length and the number of sequences, based on our recent work LinearFold that folds a single RNA in linear time. Our work is orders of magnitude faster than RNAalifold (e.g., 0.5 hours on the above 200 sequences or 316 times speedup) and achieves comparable accuracies compared to a database of known structures. More interestingly, LinearAlifold's prediction on SARS-CoV-2 correlates well with experimentally determined structures, outperforming RNAalifold. Finally, LinearAlifold supports three modes: minimum free energy (MFE), partition function, and stochastic sampling, each of which takes under an hour for hundreds of SARS-CoV variants, while only the MFE mode of RNAalifold works for them, taking days or weeks.
翻译:预测一组匹配的RNA同族体的协商一致结构是一个方便的方法,可以在RNA基因组中找到受保护的结构,该基因组在SARS-CoV-2诊断和治疗中都有应用。然而,由于序列长度的立方缩放,对于长序列来说,最先进的RNAalifold算法(RNAalifold)是令人望而却步的,在分析许多此类序列时甚至更慢,这是因为对200个SARS-COV变异体进行超线缩放,花费了4天的时间。我们展示了LinaliarAlifold,一种将匹配的RNA同族体折叠成的高效算法,根据我们最近的工作线性计算法(RNAalideFold),在线性序列和序列数量上都折叠成一个单一RNA。我们的工作比RNAalifide(例如,200多个序列的0.5小时或316倍的速度加速度)要快得多,并且与已知的结构数据库相比,我们只能取得类似的理解。更有趣的是,线性AliAlifliflifli(S-S-S)在Slormal的每数百的轨道上预测,最后的S-Formal-x-S-sl),在SBldal-coldroxxxxx 3 的每个的每个S 的每个的周期上,在S-slorum co-s-s-s-co-s cobol-co-co-co-s 。