In many machine learning tasks, a common approach for dealing with large-scale data is to build a small summary, {\em e.g.,} coreset, that can efficiently represent the original input. However, real-world datasets usually contain outliers and most existing coreset construction methods are not resilient against outliers (in particular, an outlier can be located arbitrarily in the space by an adversarial attacker). In this paper, we propose a novel robust coreset method for the {\em continuous-and-bounded learning} problems (with outliers) which includes a broad range of popular optimization objectives in machine learning, {\em e.g.,} logistic regression and $ k $-means clustering. Moreover, our robust coreset can be efficiently maintained in fully-dynamic environment. To the best of our knowledge, this is the first robust and fully-dynamic coreset construction method for these optimization problems. Another highlight is that our coreset size can depend on the doubling dimension of the parameter space, rather than the VC dimension of the objective function which could be very large or even challenging to compute. Finally, we conduct the experiments on real-world datasets to evaluate the effectiveness of our proposed robust coreset method.
翻译:在许多机器学习任务中,处理大型数据的共同方法是建立一个小摘要, 即 {em, 例如 } 核心集, 能够有效地代表原始输入。 然而, 真实世界 数据集通常包含外部值, 而大多数现有核心集构建方法没有抵御外部值的复原力( 特别是, 外部值可以在空间中被对立攻击者任意定位 ) 。 在本文中, 我们为 {em- 连续和受约束的学习} 问题( 外端) 提出了一个新颖的强大核心集 方法, 其中包括在机器学习中广泛流行的优化目标, 包括 { em, 例如 物流回归 和 $ k- 单位组合 。 此外, 我们强大的核心集可以在完全动态环境中有效维护。 据我们所知, 这是这些优化问题的首个强大和完全动态的核心集构建方法。 另一个亮点是, 我们的核心集大小取决于参数空间的翻番尺寸, 而不是目标功能的VC 层面, 它可能非常庞大, 甚至难以配置 。 最后, 我们对真实世界 数据 的实验方法进行稳健健的实验。