We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process. The resulting representations automatically capture meaningful structure in the data: it recovers ground truth generative factors in our synthetic dataset, learns factorized, semantic latent dimensions from complex natural images, and encodes video sequences into latent trajectories that are straighter than those of alternative encoders, despite training exclusively on static images. Furthermore, SAMI can extract useful representations from pre-trained diffusion models with minimal additional training. Finally, the explicitly probabilistic formulation provides new ways to identify semantically meaningful axes in the absence of supervised labels, and its mathematical exactness allows us to make formal statements about the nature of the learned representation. Overall, these results indicate that implicit structural information in diffusion models can be made explicit and interpretable through synergistic combination with a variational autoencoder.
翻译:我们提出了基于分数的多尺度推理自编码器(SAMI),这是一种无监督表示学习方法,它结合了扩散模型和变分自编码器(VAEs)的理论框架。通过统一它们各自的证据下界,SAMI构建了一个原则性的目标函数,该函数通过底层扩散过程的分数引导来学习表示。所得的表示能自动捕获数据中有意义的结构:它在我们的合成数据集中恢复了真实的生成因子;从复杂的自然图像中学习到因子化、语义化的潜在维度;并将视频序列编码成比替代编码器更直的潜在轨迹,尽管训练仅使用静态图像。此外,SAMI能够以最少的额外训练从预训练的扩散模型中提取有用的表示。最后,其显式的概率化表述提供了在缺乏监督标签的情况下识别语义有意义轴的新方法,而其数学精确性使我们能够对所学表示的性质做出形式化的陈述。总体而言,这些结果表明,扩散模型中隐含的结构信息可以通过与变分自编码器的协同组合而变得显式和可解释。