In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first deep-learning based estimator of the data manifold dimension and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.
翻译:在这项工作中,我们提出了一个新的框架,用于利用经过培训的传播模型来估计数据元的维度。一个扩散模型接近得分函数,即,噪音破坏的分布目标版本在不同腐败程度的对数密度的梯度。如果数据集中在高维环境空间嵌入的对数上,那么随着腐败程度的下降,得分函数指向方块,因为这一方向成为最大可能性增加的方向。因此,对于小腐败程度的腐败,扩散模型使我们有机会接近数据元件的正常捆绑。这使我们能够估计相切空间的维度,从而估计数据元件的内在维度。就我们所知的最好而言,我们的方法是以数据多维度的测算器为基础,它超越了对Euclidean和图像数据进行控制实验的既定统计估计器。