Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from $n$ users with user $i$ contributing data samples from a $d$-dimensional distribution with mean $\mu_i$. Our goal is to recover the linear subspace shared by $\mu_1,\ldots,\mu_n$ using the data points from all users, where every data point from user $i$ is formed by adding an independent mean-zero noise vector to $\mu_i$. If we only have one data point from every user, subspace recovery is information-theoretically impossible when the covariance matrices of the noise vectors can be non-spherical, necessitating additional restrictive assumptions in previous work. We avoid these assumptions by leveraging at least two data points from each user, which allows us to design an efficiently-computable estimator under non-spherical and user-dependent noise. We prove an upper bound for the estimation error of our estimator in general scenarios where the number of data points and amount of noise can vary across users, and prove an information-theoretic error lower bound that not only matches the upper bound up to a constant factor, but also holds even for spherical Gaussian noise. This implies that our estimator does not introduce additional estimation error (up to a constant factor) due to irregularity in the noise. We show additional results for a linear regression problem in a similar setup.
翻译:从数据中回收线性子空间是统计和机器学习中的一项根本和重要的任务。 受Federed Learning环境中不同差异驱动, 我们研究这一问题的基本配方: 主要的组件分析( PCA), 重点是处理非正常噪音。 我们的数据来自用户的美元用户, 以美元为单位, 以美元为单位, 以美元为单位, 提供数据样本。 我们的目标是利用所有用户的数据点来回收由美元=mu_ 1,\ldots,\mu_ nu_ n$ 共享的线性子空间。 使用所有用户的数据点, 每个用户的每个数据点的美元都是通过将一个独立的平均值- 零噪声矢量添加到$\ mu_ i$。 如果我们每个用户只有一个数据点, 亚空间的子回收是信息- 以美元为单位, 以美元为单位, 以美元为单位, 则需要额外的限制性假设。 我们避免这些假设, 利用每个用户至少两个数据点, 以美元为单位, 来设计一个高效的计算算算算器, $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ 。