In this big data era, we often confront large-scale data in many machine learning tasks. A common approach for dealing with large-scale data is to build a small summary, {\em e.g.,} coreset, that can efficiently represent the original input. However, real-world datasets usually contain outliers and most existing coreset construction methods are not resilient against outliers (in particular, the outliers can be located arbitrarily in the space by an adversarial attacker). In this paper, we propose a novel robust coreset method for the {\em continuous-and-bounded learning} problem (with outliers) which includes a broad range of popular optimization objectives in machine learning, like logistic regression and $ k $-means clustering. Moreover, our robust coreset can be efficiently maintained in fully-dynamic environment. To the best of our knowledge, this is the first robust and fully-dynamic coreset construction method for these optimization problems. We also conduct the experiments to evaluate the effectiveness of our robust coreset in practice.
翻译:在这个大数据时代,我们常常面对许多机器学习任务中的大规模数据。处理大规模数据的一个共同办法是建立一个小摘要,例如,}核心集,能够有效地代表原始输入。然而,现实世界数据集通常含有外部值,而大多数现有的核心集构建方法对外部值没有复原力(特别是,外部值可以由对立攻击者任意定位于空间中 ) 。在本文中,我们提出了一个新的强力核心集方法,用于解决(与外部值一起的)问题,其中包括在机器学习中实现广泛的大众优化目标,如物流回归和1k美元手段组。此外,我们的强力核心集可以在充分动态环境中有效维护。据我们所知,这是这些优化问题的首个强力和充分动态核心集构建方法。我们还进行了实验,以评价我们强大的核心集在实际中的有效性。