In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A trained diffusion model approximates the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximum likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. Our method outperforms linear methods for dimensionality detection such as PPCA in controlled experiments.
翻译:在这项工作中,我们提出了一个新的框架,用经过培训的传播模型来估计数据元的维度。经过培训的传播模型接近了不同腐败程度的目标分布的噪音破坏版本的原木密度梯度。如果数据集中在高维环境空间内嵌入的多元体周围,那么随着腐败程度的下降,分数函数指向多元体,因为这个方向成为最大可能性增加的方向。因此,对于小的腐败程度,扩散模型使我们有机会接近数据元的正常捆绑。这使我们能够估计相干空间的维度,从而估计数据元的内在维度。我们的方法超越了在受控制的实验中进行维度探测的线性方法,如PPCA。