以概率图形模型生成数据的最佳规范化 (Optimal regularizations for data generation with probabilistic graphical models)

Understanding the role of regularization is a central question in Statistical Inference. Empirically, well-chosen regularization schemes often dramatically improve the quality of the inferred models by avoiding overfitting of the training data. We consider here the particular case of L 2 and L 1 regularizations in the Maximum A Posteriori (MAP) inference of generative pairwise graphical models. Based on analytical calculations on Gaussian multivariate distributions and numerical experiments on Gaussian and Potts models we study the likelihoods of the training, test, and 'generated data' (with the inferred models) sets as functions of the regularization strengths. We show in particular that, at its maximum, the test likelihood and the 'generated' likelihood, which quantifies the quality of the generated samples, have remarkably close values. The optimal value for the regularization strength is found to be approximately equal to the inverse sum of the squared couplings incoming on sites on the underlying network of interactions. Our results seem largely independent of the structure of the true underlying interactions that generated the data, of the regularization scheme considered, and are valid when small fluctuations of the posterior distribution around the MAP estimator are taken into account. Connections with empirical works on protein models learned from homologous sequences are discussed.

翻译：在统计推论中,了解正规化的作用是一个中心问题。典型的、精心选择的正规化计划往往通过避免过度适应培训数据而大幅提高推断模型的质量。我们在这里考虑的是,在最大Aposeriori (MAP) 的基因化配对图形模型的推论中,L 2和L 1 正规化的特例。根据对高山多变分布的分析计算和对高山和波茨模型的数值实验,我们研究的是培训、测试和“生成的数据”(与推断模型一起)作为正规化优势的功能的可能性。我们特别表明,在最大程度上,测试可能性和“生成的可能性”具有非常接近的数值。根据对高山多变分布的分析计算,以及高山和波茨模型的数值实验,我们研究的结果似乎基本上独立于生成数据的真正基础互动结构、所考虑的正规化计划的结构。当对所生成样本质量进行量化的测试可能性和“生成的可能性”和“生成的可能性,即具有非常接近的数值。据发现,正规化能力的最佳价值与基数模型的平方和测算中测算的模型的小波数的模型的模型的模型的对比是有效的。