Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $\mu(X)$, which quantifies the hardness of compressing a data set for logistic regression. $\mu(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $\mu(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1\pm\varepsilon)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.
翻译:核心数据集是便于分析大型数据集的核心方法之一。 我们继续最近的一系列研究, 将核心数据集理论应用于物流回归。 首先, 我们显示一个负结果, 即不存在用于物流回归的强烈亚线型核心数据集。 为了处理棘手的最坏案例, 我们引入了一个复杂度为$\mu( X) 的计量标准, 该计量了压缩一套物流回归数据集的难度。 $\ mu( X) $ 具有可能具有独立兴趣的直观统计解释。 对于有约束值$mu( X) $( X) 的复杂度的数据集, 我们显示一个新的敏感度取样计划产生了第一个可移动的亚线性亚线性( $(1\\ p\ pm\ varrepsilon) $( $) 核心。 我们通过比较统一的取样以及该地区艺术方法的状态来说明我们方法的性能。 实验是根据真实的世界物流回归基准数据进行的。