Applying machine learning to biological sequences - DNA, RNA and protein - has enormous potential to advance human health, environmental sustainability, and fundamental biological understanding. However, many existing machine learning methods are ineffective or unreliable in this problem domain. We study these challenges theoretically, through the lens of kernels. Methods based on kernels are ubiquitous: they are used to predict molecular phenotypes, design novel proteins, compare sequence distributions, and more. Many methods that do not use kernels explicitly still rely on them implicitly, including a wide variety of both deep learning and physics-based techniques. While kernels for other types of data are well-studied theoretically, the structure of biological sequence space (discrete, variable length sequences), as well as biological notions of sequence similarity, present unique mathematical challenges. We formally analyze how well kernels for biological sequences can approximate arbitrary functions on sequence space and how well they can distinguish different sequence distributions. In particular, we establish conditions under which biological sequence kernels are universal, characteristic and metrize the space of distributions. We show that a large number of existing kernel-based machine learning methods for biological sequences fail to meet our conditions and can as a consequence fail severely. We develop straightforward and computationally tractable ways of modifying existing kernels to satisfy our conditions, imbuing them with strong guarantees on accuracy and reliability. Our proof techniques build on and extend the theory of kernels with discrete masses. We illustrate our theoretical results in simulation and on real biological data sets.
翻译:应用机器学习于生物序列——DNA、RNA和蛋白质,对于推进人类健康、环境可持续发展和基本生物理解具有巨大的潜力。然而,许多现有的机器学习方法在这一问题领域中是无效的或不可靠的。我们通过核函数的角度从理论上研究这些挑战。基于核的方法是普遍存在的:它们用于预测分子表型、设计新的蛋白质、比较序列分布等。许多不显式使用核的方法仍然隐式地依赖于它们,包括各种深度学习和基于物理的技术。虽然其他类型数据的核在理论上已经有了充分的研究,但生物序列空间(离散、可变长度序列)的结构以及生物序列相似性的概念,呈现出了独特的数学挑战。我们从理论上分析了生物序列核函数在序列空间上逼近任意函数和区分不同序列分布的能力。特别是,我们建立了生物序列核函数是通用的、特征性的,并可以测量分布空间的条件。我们发现许多现有的生物序列基于核的机器学习方法不能满足我们的条件,因此可能会严重失败。我们开发了简单易行并且计算机可处理的方法,修改现有核函数以满足我们的条件,从而赋予它们在准确性和可靠性方面的强有力保证。我们的证明技术基于和扩展了具有离散质量的核的理论。我们在模拟和实际生物数据集上演示了我们的理论结果。