CovID-19 genetics analysis is critical to determine virus type,virus variant and evaluate vaccines. In this paper, SARS-Cov-2 RNA sequence analysis relative to region or territory is investigated. A uniform framework of sequence SVM model with various genetics length from short to long and mixed-bases is developed by projecting SARS-Cov-2 RNA sequence to different dimensional space, then scoring it according to the output probability of pre-trained SVM models to explore the territory or origin information of SARS-Cov-2. Different sample size ratio of training set and test set is also discussed in the data analysis. Two SARS-Cov-2 RNA classification tasks are constructed based on GISAID database, one is for mainland, Hongkong and Taiwan of China, and the other is a 6-class classification task (Africa, Asia, Europe, North American, South American\& Central American, Ocean) of 7 continents. For 3-class classification of China, the Top-1 accuracy rate can reach 82.45\% (train 60\%, test=40\%); For 2-class classification of China, the Top-1 accuracy rate can reach 97.35\% (train 80\%, test 20\%); For 6-class classification task of world, when the ratio of training set and test set is 20\% : 80\% , the Top-1 accuracy rate can achieve 30.30\%. And, some Top-N results are also given.
翻译:CovID-19遗传学分析对于确定病毒类型、病毒变量和疫苗评估至关重要。本文对SARS-Cov-2 RNA区域或地区序列分析进行了调查。SARS-Cov-2 RNA区域或地区序列分析,通过将SARS-Cov-2 RNA的基因长度从短到长、混合基数不同的SVM序列模型统一框架,将SARS-Cov-2 RNA序列序列模型投射到不同的维度空间,然后根据预先培训的SVM模型的输出概率评分,以探索SARS-Cov-2的地域或来源信息。数据分析中也讨论了培训组和测试组的不同抽样规模比。两个SARS-Cov-2 RNA的分类任务是根据GISAID数据库构建的,一个是中国大陆、香港和台湾的,另一个是7个大洲的6级分类(非洲、亚洲、北美、南美洲、中美洲、海洋),然后根据预先培训的SVMMM模型的输出概率可达82.45 ⁇ (Train 60 ⁇,测试=40 ⁇ );对于中国的2级分类来说,在20级中达到TO级的精确率比率(Top-1),也可以达到80个测试任务设定的测试率。