There has been a growing interest in statistical inference from data satisfying the so-called manifold hypothesis, assuming data points in the high-dimensional ambient space to lie in close vicinity of a submanifold of much lower dimension. In machine learning, encoder-decoder pair based generative modelling approaches have been successful in learning complicated high-dimensional distributions such as those over images and texts by explicitly imposing the low-dimensional manifold structure. In this work, we introduce a new approach for estimating distributions on unknown submanifolds via mixtures of generative models. We show that conventional generative modeling approaches using a single encoder-decoder pair are generally unable to capture data distributions under the manifold hypothesis, unless the underlying manifold admits a global parametrization; however, this issue can be solved by using a collection of encoder-decoder pairs for learning different local patches of the data supporting manifold. A rigorous theoretical analysis is developed to demonstrate that the proposed estimator attains the minimax-optimal rate of convergence for the implicit estimation of data distributions with manifold structures. Our experiments show that, by utilizing parameter sharing, the proposed method can significantly improve the performance of conventional auto-encoder based generative modelling approaches with minimal additional computational efforts.
翻译:从符合所谓多重假设的数据中得出统计推论的兴趣日益浓厚,即假定高维环境空间的数据点位于低维的子层附近。在机器学习中,基于基因模拟模型的编码解码对配对方法成功地学习了复杂的高维分布,如图像和文本的图象和文本的图象。在这项工作中,我们引入了一种新的方法,通过基因模型的混合来估计未知子层的分布。我们表明,使用单一的编码脱coder-decoder对配方的常规基因模型方法通常无法在多重假设下捕捉数据分布,除非基本的元体承认全球的对称化;然而,这个问题可以通过收集编码解码-解码配对来解决,以学习支持多元数据的不同局部部分。我们开发了严格的理论分析,以证明提议的估算方达到与多元结构的隐含数据分布的最小-最佳趋同率。我们进行的实验表明,通过利用传统模型的模型化方法,可以大大改进基于常规模型的模型化工作,从而大大地改进了拟议的基因模型的改进方法。