Density estimation is a fundamental problem in statistics, and any attempt to do so in high dimensions typically requires strong assumptions or complex deep learning architectures. An important application for density estimators is synthetic data generation, an area currently dominated by neural networks that often demand enormous training datasets and extensive tuning. We propose a new method based on unsupervised random forests for estimating smooth densities in arbitrary dimensions without parametric constraints, as well as generating realistic synthetic data. We prove the consistency of our approach and demonstrate its advantages over existing tree-based density estimators, which generally rely on ill-chosen split criteria and do not scale well with data dimensionality. Experiments illustrate that our algorithm compares favorably to state-of-the-art deep learning generative models, achieving superior performance in a range of benchmark trials while executing about two orders of magnitude faster on average. Our method is implemented in easy-to-use $\texttt{R}$ and Python packages.
翻译:密度估计是统计中的一个根本问题,在高维度方面,任何这样做的尝试通常都需要强有力的假设或复杂的深层次学习结构。密度估计器的一个重要应用是合成数据生成,这是一个目前由神经网络主导的领域,通常需要大量的培训数据集和广泛的调试。我们提议基于不受监督的随机森林的新方法,在没有参数限制的情况下,对任意度度的光滑密度进行估计,并生成现实的合成数据。我们证明了我们的方法的一致性,并展示了它比现有基于树木的密度估计器的优势,后者通常依赖错误的分选标准,与数据多维度不相称。实验表明,我们的算法优于最先进的深层学习基因化模型,在一系列基准试验中取得优异性,同时平均执行两个数量级的更快。我们的方法在容易使用的 $\ textt{R}$ 和 Python 套件中实施。