Learning from data in the presence of outliers is a fundamental problem in statistics. Until recently, no computationally efficient algorithms were known to compute the mean of a high dimensional distribution under natural assumptions in the presence of even a small fraction of outliers. In this paper, we consider robust statistics in the presence of overwhelming outliers where the majority of the dataset is introduced adversarially. With only an $\alpha < 1/2$ fraction of "inliers" (clean data) the mean of a distribution is unidentifiable. However, in their influential work, [CSV17] introduces a polynomial time algorithm recovering the mean of distributions with bounded covariance by outputting a succinct list of $O(1/\alpha)$ candidate solutions, one of which is guaranteed to be close to the true distributional mean; a direct analog of 'List Decoding' in the theory of error correcting codes. In this work, we develop an algorithm for list decodable mean estimation in the same setting achieving up to constants the information theoretically optimal recovery, optimal sample complexity, and in nearly linear time up to polylogarithmic factors in dimension. Our conceptual innovation is to design a descent style algorithm on a nonconvex landscape, iteratively removing minima to generate a succinct list of solutions. Our runtime bottleneck is a saddle-point optimization for which we design custom primal dual solvers for generalized packing and covering SDP's under Ky-Fan norms, which may be of independent interest.
翻译:从存在外部线的数据中学习数据是统计的一个根本问题。 直到最近, 还没有已知的计算高效算法在自然假设下计算高维分布的平均值, 即使有一小部分外部线存在。 在本文中, 我们考虑强大的统计数据, 在绝大多数数据集都是对抗性引入的压倒性外部线中, 多数数据集都存在极强的外部线人。 在“ 内离子” (清洁数据) 中, 分配的平均值是无法辨别的。 然而, 在其有影响力的工作中, [ CSV17] 引入了一个多元时算法, 通过输出一个 $( 1/\\ ALpha) 候选解决方案的简明列表, 来恢复带有约束性共维度的分布值。 其中之一可以保证接近于真正的分配平均值; 在错误校正代码的理论中, 我们开发了一个直接平均值估算的算法, 在同一环境中, 实现理论上最优的恢复, 最优的样本复杂度 DP, 和 近线性时间里 的Slimal- florimalalal 版本, Exalalalalalal exal exal exal lagistrevational ex