等式右边第二个期望可用 VAE 中常采用的 q_ϕ (z|x) 重参数化技术(reparameterization trick)用蒙特卡罗(Monte Carlo)方法估计[2],而估计其对数则可使用 logsumexp 技术保证数值稳定。相比之下,前文提到的 DAE 的目标函数 E_(p^* (x) q_ϕ (z^'│x)) [log p_θ (x│z^')]却是它的上界(注意这里需要最大化),因而无法保证拟合效果。特别是,它会让 q_ϕ (z│x) 变为集中在 p(x|z) 对 z 的最大值解上的狄拉克分布(即模式坍缩,mode-collapse),所以破坏了决定性。这解释了它在图1中的失败。CyGen 最终的优化目标是此最大似然目标与上述相容性目标之和。 至于生成数据,由于两条件分布没有定义一个从头到尾的生成过程,祖先采样(ancestral sampling)无法使用,但仍然可用马尔可夫链蒙特卡罗(Markov chain Monte Carlo,MCMC)方法采样。它们只需目标分布的未归一化密度函数,而 CyGen 正好可以提供:p_(θ,ϕ) (x)∝(p_θ (x|z))/(q_ϕ (z|x))。研究员们选取了基于动力学系统的 MCMC 方法,例如随机梯度朗之万动力学系统方法(stochastic gradient Langevin dynamics,SGLD)[26],详细信息可阅读论文原文。 实验结果 除了图1中合成数据集上的实验结果,研究员们也在真实图片数据集 MNIST 和 SVHN 上做了实验。为公平比较,所有方法都用了同样的条件分布模型结构。由于是概率性的(为满足 CyGen 的决定性),BiGAN 的训练过程十分不稳定,未能产生合理的结果,所以没有展示。图2中的结果表明,CyGen 的生成效果十分清晰而多样,且用它提取的数据表示所训练的分类器取得了最高的准确率。这分别体现了 CyGen 避免流形错配及后验坍缩的好处。相比之下,DAE 由于没有决定性,生成质量不好,所以 VAE 的准确率是最低的,表明它有明显的后验坍缩问题。论文也给出了更多实验分析,表明引入相容性损失函数是必要的,使用 SGLD 生成数据的效果好于使用吉布斯采样,并且尽管舍弃了先验分布,人类知识仍可通过条件分布模型进入到生成式建模中。 图2:真实图片数据集MNIST和SVHN上各类生成模型的生成效果以及用它们提取的数据表示所训练的分类器的百分准确率(各图右下角)。 结语与展望 本工作为“两条件分布是否可确定联合分布”这个问题建立了一个统一的理论框架,包括联合分布的存在性和唯一性——即两条件分布的相容性和决定性——的充分必要判据或充分条件,并基于此理论提出了一个仅需两条件分布模型而无需指定先验分布的生成式建模全新模式 CyGen,包括实现相容性和决定性以及拟合和生成数据的算法。实验展示了 CyGen 因解除对指定先验分布的需求而带来的更好的生成效果及更有用的数据表示等优势。 CyGen 这种生成式建模模式会为很多应用领域带来便利,因为在大部分场景中很难知道一个先验分布,但却对条件分布有一定的知识,例如图片特征对图片平移旋转缩放的不变性。另一方面,本文中所建立的基础理论也能对机器学习其他领域(例如对偶学习和自监督学习)带来新的认识,进而启发新的分析和算法。 参考文献 [1] P. Billingsley, Probability and Measure. New Jersey: John Wiley & Sons, 2012.[2] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proceedings of the International Conference on Learning Representations, 2014.[3] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic Backpropagation and Approximate Inference in Deep Generative Models,” in International Conference on Machine Learning, 2014, pp. 1278–1286.[4] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep Unsupervised Learning using Nonequilibrium Thermodynamics,” in International Conference on Machine Learning, 2015.[5] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, 2020, vol. 256.[6] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-Based Generative Modeling through Stochastic Differential Equations,” in Proceedings of the International Conference on Learning Representations, 2021, pp. 1–32.[7] B. C. Arnold and S. J. Press, “Compatible conditional distributions,” J. Am. Stat. Assoc., vol. 84, no. 405, pp. 152–156, 1989.[8] B. C. Arnold, E. Castillo, and J. María Sarabia, “Conditionally specified distributions: An introduction,” Stat. Sci., vol. 16, no. 3, pp. 249–265, 2001.[9] B. C. Arnold, E. Castillo, and J.-M. Sarabia, Conditionally Specified Distributions. 1992.[10] P. Berti, E. Dreassi, and P. Rigo, “Compatibility results for conditional distributions,” J. Multivar. Anal., vol. 125, pp. 190–203, 2014.[11] I. J. Goodfellow et al., “Generative Adversarial Nets,” in Advances in Neural Information Processing Systems, pp. 1–9, 2014.[12] L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear Independent Components Estimation,” in workshop on the International Conference on Learning Representations, 2015.[13] D. P. Kingma and P. Dhariwal, “Glow: Generative Flow with Invertible 1x1 Convolutions,” in Advances in Neural Information Processing Systems, 2018.[14] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to Discover Cross-Domain Relations with Generative Adversarial Networks,” in International Conference on Machine Learning, 2017.[15] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks,” in IEEE International Conference on Computer Vision (ICCV), 2017.[16] Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: Unsupervised Dual Learning for Image-to-Image Translation,” in IEEE International Conference on Computer Vision (ICCV), 2017.[17] Y. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T.-Y. Liu, “Dual Supervised Learning,” in International Conference on Machine Learning, 2017, no. 1.[18] Y. Xia et al., “Dual Learning for Machine Translation,” in Advances in Neural Information Processing Systems, 2016.[19] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial Feature Learning,” in Proceedings of the International Conference on Learning Representations, 2017.[20] V. Dumoulin et al., “Adversarially Learned Inference,” in Proceedings of the International Conference on Learning Representations, 2017.[21] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and Composing Robust Features with Denoising Autoencoders,” in International Conference on Machine Learning, 2008, vol. 311, pp. 1–10.[22] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising auto-encoders as generative models,” in Advances in Neural Information Processing Systems, 2013, pp. 1–9.[23] Y. Bengio, É. Thibodeau-Laufer, G. Alain, and J. Yosinski, “Deep generative stochastic networks trainable by backprop,” in International Conference on Machine Learning, 2014, vol. 2, pp. 1470–1485.[24] H. Shao, A. Kumar, and P. Thomas Fletcher, “The Riemannian geometry of deep generative models,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2018.[25] R. Van Den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling, “Sylvester normalizing flows for variational inference,” in 34th Conference on Uncertainty in Artificial Intelligence, 2018, vol. 1, pp. 393–402.[26] M. Welling and Y. W. Teh, “Bayesian Learning via Stochastic Gradient Langevin Dynamics,” in International Conference on Machine Learning, 2011. 你也许还想看: