与蛋白生物信息学应用的球面数据混合模型 (Mixture models for spherical data with applications to protein bioinformatics)

Finite mixture models are fitted to spherical data. Kent distributions are used for the components of the mixture because they allow considerable flexibility. Previous work on such mixtures has used an approximate maximum likelihood estimator for the parameters of a single component. However, the approximation causes problems when using the EM algorithm to estimate the parameters in a mixture model. Hence the exact maximum likelihood estimator is used here for the individual components. This paper is motivated by a challenging prize problem in structural bioinformatics of how proteins fold. It is known that hydrogen bonds play a key role in the folding of a protein. We explore this hydrogen bond geometry using a data set describing bonds between two amino acids in proteins. An appropriate coordinate system to represent the hydrogen bond geometry is proposed, with each bond represented as a point on a sphere. We fit mixtures of Kent distributions to different subsets of the hydrogen bond data to gain insight into how the secondary structure elements bond together, since the distribution of hydrogen bonds depends on which secondary structure elements are involved.

翻译：精密混合物模型适合于球体数据。肯特分布用于混合物的成分,因为它们具有相当大的灵活性。关于这种混合物的以往工作曾使用过一种估计单一成分参数的大致最大可能性的测算仪。然而,近似在使用EM算法估计混合物模型中的参数时造成问题。因此,此处对单个成分使用精确的最大可能性估测仪。本文的动机是蛋白质折叠的结构性生物信息学中具有挑战性的奖项问题。众所周知,氢债券在蛋白质折叠中起着关键作用。我们利用一组数据来探讨氢债券的几何学,该数据集描述蛋白质中两个氨基酸之间的键。提议了一种适当的协调系统来代表氢债券的几何学,每个链接都代表一个球体上的点。我们把肯特分布的混合物用于氢债券数据的不同子组,以深入了解二级结构元素的组合是如何结合的,因为氢债券的分布取决于其中的次级结构元素。