Estimating the rank of a corrupted data matrix is an important task in data science, most notably for choosing the number of components in principal component analysis. Significant progress on this task has been made using random matrix theory by characterizing the spectral properties of large noise matrices. However, utilizing such tools is not straightforward when the data matrix consists of count random variables, such as Poisson or binomial, in which case the noise can be heteroskedastic with an unknown variance in each entry. In this work, focusing on a Poisson random matrix with independent entries, we propose a simple procedure termed \textit{biwhitening} that makes it possible to estimate the rank of the underlying data matrix (i.e., the Poisson parameter matrix) without any prior knowledge on its structure. Our approach is based on the key observation that one can scale the rows and columns of the data matrix simultaneously so that the spectrum of the corresponding noise agrees with the standard Marchenko-Pastur (MP) law, justifying the use of the MP upper edge as a threshold for rank selection. Importantly, the required scaling factors can be estimated directly from the observations by solving a matrix scaling problem via the Sinkhorn-Knopp algorithm. Aside from the Poisson distribution, we extend our biwhitening approach to other discrete distributions, such as the generalized Poisson, binomial, multinomial, and negative binomial. We conduct numerical experiments that corroborate our theoretical findings, and demonstrate our approach on real single-cell RNA sequencing (scRNA-seq) data, where we show that our results agree with a slightly overdispersed generalized Poisson model.
翻译:估算腐败数据矩阵的等级是数据科学中的一项重要任务,其中最突出的是选择主要组成部分分析中的组件数量。任务上的重大进展是使用随机矩阵理论,将大型噪音矩阵的光谱属性定性为随机矩阵理论。然而,当数据矩阵包含随机变量时,如Poisson或binoomial, 使用这些工具并不简单, 因为在这些数据矩阵中, 噪音可以同时变异, 在每个条目中出现未知的差异。 在这项工作中, 侧重于包含独立条目的 Poisson 随机矩阵, 我们提议了一个名为\ textit{bewletning} 的简单程序, 从而可以在不事先了解其结构的情况下估算基本数据矩阵的级别( 即 Poisson 参数矩阵 矩阵 ) 。 我们的方法基于这样的关键观察, 即一个人可以同时缩放数据矩阵的行和列, 相应的噪音频谱与标准 Prenti- Pastur ( MP) 法律相匹配, 证明使用 MP 的顶端点作为排名的门槛值 。 。 Kencialalalalalalal 方法要求缩化因素可以直接通过Sink 进行我们Skinalalal 的排序分配。