Many statistical studies are concerned with the analysis of observations organized in a matrix form whose elements are count data. When these observations are assumed to follow a Poisson or a multinomial distribution, it is of interest to focus on the estimation of either the intensity matrix (Poisson case) or the compositional matrix (multinomial case) when it is assumed to have a low rank structure. In this setting, it is proposed to construct an estimator minimizing the regularized negative log-likelihood by a nuclear norm penalty. Such an approach easily yields a low-rank matrix-valued estimator with positive entries which belongs to the set of row-stochastic matrices in the multinomial case. Then, as a main contribution, a data-driven procedure is constructed to select the regularization parameter in the construction of such estimators by minimizing (approximately) unbiased estimates of the Kullback-Leibler (KL) risk in such models, which generalize Stein's unbiased risk estimation originally proposed for Gaussian data. The evaluation of these quantities is a delicate problem, and novel methods are introduced to obtain accurate numerical approximation of such unbiased estimates. Simulated data are used to validate this way of selecting regularizing parameters for low-rank matrix estimation from count data. For data following a multinomial distribution, the performances of this approach are also compared to $K$-fold cross-validation. Examples from a survey study and metagenomics also illustrate the benefits of this methodology for real data analysis.
翻译:许多统计研究都涉及对以矩阵形式组织的观测的分析,其要素为计数数据;当假设这些观测遵循Poisson 或多数值分布法时,如果假设强度矩阵(Poisson 案例)或构成矩阵(多数值案例)结构较低,则重点估计强度矩阵(多数值案例)或组成矩阵(多数值案例)。在这一背景下,建议构建一个估算器,以核规范处罚来尽量减少正常的负日志相似性。这种方法很容易产生一个低级别矩阵估值的估算器,带有属于多数值分布法的正条目。然后,作为主要贡献,建立一个数据驱动程序,通过尽量减少(约)对Kullback-Leeper (KL) 风险的公正估计,从而在这类模型中将Stein'n's公平风险评估方法(最初为Gaussian 数据提出的)概括性风险评估。 评估这些数量是一个微妙的问题,并且采用新的方法,从这种精确的定量数据估算中选取精确的数值缩略性数据估算法。