Explainable and interpretable unsupervised machine learning helps understand the underlying structure of data. We introduce an ensemble analysis of machine learning models to consolidate their interpretation. Its application shows that restricted Boltzmann machines compress consistently into a few bits the information stored in a sequence of five amino acids at the start or end of $\alpha$-helices or $\beta$-sheets. The weights learned by the machines reveal unexpected properties of the amino acids and the secondary structure of proteins: (i) His and Thr have a negligible contribution to the amphiphilic pattern of $\alpha$-helices; (ii) there is a class of $\alpha$-helices particularly rich in Ala at their end; (iii) Pro occupies most often slots otherwise occupied by polar or charged amino acids, and its presence at the start of helices is relevant; (iv) Glu and especially Asp on one side, and Val, Leu, Iso, and Phe on the other, display the strongest tendency to mark amphiphilic patterns, i.e., extreme values of an "effective hydrophobicity", though they are not the most powerful (non) hydrophobic amino acids.
翻译:可解释和可解释的无监督机器学习有助于理解数据的潜在结构。我们介绍一种机器学习模型的集合分析,以 conslidate 理解。它的应用表明,受限玻尔兹曼机器经常压缩在 $\alpha$-helices or $\beta$-sheets的开始或结尾的五个氨基酸序列中存储的信息,变成了一个容量更小而精简的信息片段。机器学习模型学习到的权重揭示了氨基酸和蛋白质二级结构的意外特性: (i)His和Thr对$\alpha$-helices中的亲疏性模式的贡献微不足道; (ii)有一类 $\alpha$-helices 在其末尾富含酪氨酸; (iii) Pro最常用于占用极性或电荷氨基酸的位置,它在螺旋的开头的存在很重要; (iv)谷氨酸和尤其是天门冬氨酸在一侧,以及缬氨酸、亮氨酸、异亮氨酸和苯丙氨酸在另一侧,显示出标记亲疏性模式,即“有效疏水性”的极端值,尽管它们不是最强大的(非)疏水性氨基酸。