SARS-CoV-2 is an upper respiratory system RNA virus that has caused over 3 million deaths and infecting over 150 million worldwide as of May 2021. With thousands of strains sequenced to date, SARS-CoV-2 mutations pose significant challenges to scientists on keeping pace with vaccine development and public health measures. Therefore, an efficient method of identifying the divergence of lab samples from patients would greatly aid the documentation of SARS-CoV-2 genomics. In this study, we propose a neural network model that leverages recurrent and convolutional units to directly take in amino acid sequences of spike proteins and classify corresponding clades. We also compared our model's performance with Bidirectional Encoder Representations from Transformers (BERT) pre-trained on protein database. Our approach has the potential of providing a more computationally efficient alternative to current homology based intra-species differentiation.
翻译:SARS-COV-2是一种上呼吸系统RNA病毒,截至2021年5月,已在全世界造成300多万人死亡和1.5亿以上感染RNA病毒。迄今为止,SARS-COV-2突变已测出数千种菌株,这对科学家在跟上疫苗研制和公共卫生措施的步伐方面构成重大挑战。因此,一种查明实验室样品与病人之间差异的有效方法将大有助于SARS-COV-2基因组文件的编制工作。在本研究中,我们提出一个神经网络模型,利用经常和革命单位直接接收麻木酸序列的钉状蛋白并进行相应的分类。我们还将我们模型的性能与变异器在蛋白数据库上预先培训过的双向电解剖仪进行了比较。我们的方法有可能提供一种更高效的替代目前基于物种内部差异的同系的计算方法。