Boltzmann machines (BM) are widely used as generative models. For example, pairwise Potts models (PM), which are instances of the BM class, provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino-acid conservation, and the two-site couplings, which mirror the coevolution between pairs of sites. This coevolution reflects structural and functional constraints acting on protein sequences during evolution. The most conservative choice to describe the coevolution signal is to include all possible two-site couplings into the PM. This choice, typical of what is known as Direct Coupling Analysis, has been successful for predicting residue contacts in the three-dimensional structure, mutational effects, and in generating new functional sequences. However, the resulting PM suffers from important over-fitting effects: many couplings are small, noisy and hardly interpretable; the PM is close to a critical point, meaning that it is highly sensitive to small parameter perturbations. In this work, we introduce a general parameter-reduction procedure for BMs, via a controlled iterative decimation of the less statistically significant couplings, identified by an information-based criterion that selects either weak or statistically unsupported couplings. For several protein families, our procedure allows one to remove more than $90\%$ of the PM couplings, while preserving the predictive and generative properties of the original dense PM, and the resulting model is far away from criticality, hence more robust to noise.
翻译:Boltzmann 机器( BM) 被广泛用作基因模型。 例如, 配对式波茨模型( PM), 是BM 级的例子, 提供进化相关蛋白序列家族的精确统计模型。 其参数是当地域, 描述特定地点的氨基酸保护模式, 以及反映两地间变异的两地连接。 这种变异反映了蛋白序列在进化过程中的结构性和功能性制约。 描述卷土重来信号的最保守选择是将所有可能的两地组合都纳入 PM 。 这个选择, 典型的被称为直接调合分析的典型, 提供了进化性蛋白质序列的精确统计模型。 但是, 由此形成的PM( ) 具有重要的超适应效应: 许多组合是小的, 噪音和难以解释的; PM 接近一个临界点, 意味着它对于小点的变异质信号非常敏感。 在这项工作中, 我们引入了一种普通的基调模型, 降低原基值程序, 使得BMS( 直接调) 的精度分析, 通过一个较弱的精度程序, 通过一个统计级的精度程序, 的精度程序, 导致一个不具有质的精度的精度的精度的精度的精度程序, 通过一个精确的精度的精度的精度的精度程序, 通过一个 导致一个 的精度的精度, 通过一个统计性的精度的精度的精度 的精度的精度的精度, 通过一个精确度化的精度的精度化的精度化的精度 的精度,, 通过一个精确性化的精度化的精度的精度, 通过控制的精度 的精度 的精度 的精度 的精度 的精度 的精度 的精度 的精度 的精度 的精度 的精度化的精度,, 通过一个的精度, 通过一个精确级的精度, 的精度的精度化的精度化的精度 的精度 的精度 的精度 的精度 被的精度, 被的精度, 通过一个的精度化的精度, 通过一个