The SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans. Like many coronaviruses, it can adapt to different hosts and evolve into different lineages. It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein. Understanding the spike protein structure and how it can be perturbed is vital for understanding and determining if a lineage is of concern. These are crucial to identifying and controlling current outbreaks and preventing future pandemics. Machine learning (ML) methods are a viable solution to this effort, given the volume of available sequencing data, much of which is unaligned or even unassembled. However, such ML methods require fixed-length numerical feature vectors in Euclidean space to be applicable. Similarly, euclidean space is not considered the best choice when working with the classification and clustering tasks for biological sequences. For this purpose, we design a method that converts the protein (spike) sequences into the sequence similarity network (SSN). We can then use SSN as an input for the classical algorithms from the graph mining domain for the typical tasks such as classification and clustering to understand the data. We show that the proposed alignment-free method is able to outperform the current SOTA method in terms of clustering results. Similarly, we are able to achieve higher classification accuracy using well-known Node2Vec-based embedding compared to other baseline embedding approaches.
翻译:SARS-COV-2 Corona病毒是人类中COVID-19疾病的原因。 和许多 Corona病毒一样, 它可以适应不同的主机, 并演变成不同的直系。 众所周知, 主要的 SARS- CoV-2 线条的特点是主要发生于钉钉蛋白质中的突变。 了解尖峰蛋白结构以及它如何被扰动对于理解和确定一个线条是否引起关注至关重要。 这些对于查明和控制当前爆发并防止未来流行病至关重要。 机器学习(ML)方法是这一努力的一个可行的解决办法, 因为现有的测序数据数量很多, 其中很多是不匹配的, 甚至没有融合。 然而, 众所周知, 这样的测序方法需要Euclidean 空间中固定长度的数值矢量矢量矢量矢量矢量矢量矢量。 同样, euclidean 空间在与生物序列的分类和组合任务时并不认为最佳选择。 为此, 我们设计一种将蛋白(spike) 序列序列转换为相似的序列网络(SSN) 。 然后, 我们可以用SN- lideal lidealalalalalalationalational- dalation laction magistration laction 来将Silding magistration makeding magald lactions