与 Hemaglutinin 序列一起对流感病毒宿主的预测 (Dive into Machine Learning Algorithms for Influenza Virus Host Prediction with Hemagglutinin Sequences)

Influenza viruses mutate rapidly and can pose a threat to public health, especially to those in vulnerable groups. Throughout history, influenza A viruses have caused pandemics between different species. It is important to identify the origin of a virus in order to prevent the spread of an outbreak. Recently, there has been increasing interest in using machine learning algorithms to provide fast and accurate predictions for viral sequences. In this study, real testing data sets and a variety of evaluation metrics were used to evaluate machine learning algorithms at different taxonomic levels. As hemagglutinin is the major protein in the immune response, only hemagglutinin sequences were used and represented by position-specific scoring matrix and word embedding. The results suggest that the 5-grams-transformer neural network is the most effective algorithm for predicting viral sequence origins, with approximately 99.54% AUCPR, 98.01% F1 score and 96.60% MCC at a higher classification level, and approximately 94.74% AUCPR, 87.41% F1 score and 80.79% MCC at a lower classification level.

翻译：流感A病毒在历史上造成不同物种之间的流行病。重要的是要确定病毒的源头,以防止疾病爆发的蔓延。最近,人们越来越有兴趣使用机器学习算法,为病毒序列提供快速和准确的预测。在这项研究中,使用了真正的测试数据集和各种评价指标来评价不同分类层次的机器学习算法。由于 heagglutinin是免疫反应中的主要蛋白质,因此,只有异位评分矩阵和单词嵌入才使用和代表着 hegglutinin序列。结果显示,5克透明神经网络是预测病毒序列源的最有效算法,大约99.54%的AUCPR、98.01%的F1分和96.60%的MCC,在较低分类级别上大约94.74%的ACPR、87.41%的F1分和80.79%的MCC。