Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between $k$-mers (sub-sequences of length $k$) and an approximate approach that estimates pairwise similarity scores. Although exact methods yield better classification performance, they pose high computational costs, limiting their applicability to a small number of sequences. The approximate algorithms are proven to be more scalable and perform comparably to (sometimes better than) the exact methods -- they are designed in a "general" way to deal with different types of sequences (e.g., music, protein, etc.). Although general applicability is a desired property of an algorithm, it is not the case in all scenarios. For example, in the current COVID-19 (coronavirus) pandemic, there is a need for an approach that can deal specifically with the coronavirus. To this end, we propose a series of ways to improve the performance of the approximate kernel (using minimizers and information gain) in order to enhance its predictive performance pm coronavirus sequences. More specifically, we improve the quality of the approximate kernel using domain knowledge (computed using information gain) and efficient preprocessing (using minimizers computation) to classify coronavirus spike protein sequences corresponding to different variants (e.g., Alpha, Beta, Gamma). We report results using different classification and clustering algorithms and evaluate their performance using multiple evaluation metrics. Using two datasets, we show that our proposed method helps improve the kernel's performance compared to the baseline and state-of-the-art approaches in the healthcare domain.
翻译:诸如 SVM 等机器学习( ML) 模型, 诸如 SVM 等, 用于对序列进行分类和组合等任务, 需要界定序列序列之间的距离/ 相似性。 提议了几种方法来计算序列之间的相似性, 例如精确计算美元- 摩尔( 以美元计次数) 和估计相似性分数的近似方法。 尽管精确方法能产生更好的分类性能, 却带来较高的计算成本, 限制其适用于少数序列。 近似算法已被证明更可伸缩, 并且可以( 有时比) 精确方法更具有可比性。 它们被设计为“ 常规” 方法来计算序列之间的相似性, 比如计算美元- 美元- 美元( 以美元计次数为后序) 。 虽然一般适用性是算法的预期属性, 但并非所有情况都如此。 例如, 在当前的 COVID-19 ( coronalbl) 流行, 需要一种方法可以具体处理 Coronalal 病毒的可变性评估性( 有时比值),, 并且用更精确的内值 数据序列 来显示我们使用更精度 的计算 的变数 数据序列, 我们的变数 的计算方法, 使用更精确的计算方法, 使用更精化的计算方法来改进其变数 。