Establishing a low-dimensional representation of the data leads to efficient data learning strategies. In many cases, the reduced dimension needs to be explicitly stated and estimated from the data. We explore the estimation of dimension in finite samples as a constrained optimization problem, where the estimated dimension is a maximizer of a penalized profile likelihood criterion within the framework of a probabilistic principal components analysis. Unlike other penalized maximization problems that require an "optimal" penalty tuning parameter, we propose a data-averaging procedure whereby the estimated dimension emerges as the most favourable choice over a range of plausible penalty parameters. The proposed heuristic is compared to a large number of alternative criteria in simulations and an application to gene expression data. Extensive simulation studies reveal that none of the methods uniformly dominate the other and highlight the importance of subject-specific knowledge in choosing statistical methods for dimension learning. Our application results also suggest that gene expression data have a higher intrinsic dimension than previously thought. Overall, our proposed heuristic strikes a good balance and is the method of choice when model assumptions deviated moderately.
翻译:在许多情况下,需要从数据中明确说明和估计较低的维度。我们探讨将有限样本中的维度估计为一个有限的优化问题,因为估计的维度是概率主要组成部分分析框架内受惩罚的剖面概率标准的最大化。与其他需要“最佳”惩罚调试参数的受惩罚最大化问题不同的是,我们提议了一个数据稳定程序,根据这一程序,估计的维度将出现为对一系列合理的惩罚参数最有利的选择。拟议的超度与模拟和基因表达数据应用中的大量替代标准相比较。广泛的模拟研究显示,没有任何一种方法能够统一支配其他方法,并强调特定主题知识在选择维度学习统计方法方面的重要性。我们的应用结果还表明,基因表达数据具有比以前想象的更高的内在层面。总体而言,我们提议的超度平衡是模型假设偏离中度时的选择方法。