The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample. Such an approach has been used pervasively in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome the limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation in classical linear regression. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an $O(\mbox{nnz}(\mathbf{X})+rp^2)$ computational cost, where $\mathbf{X}$ is an $n\times p$ predictor matrix, $r$ is the number of elements selected from each column of $\mathbf{X}$, and $\mbox{nnz}(\cdot)$ denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and open-source datasets demonstrate the proposed method's superior performance compared to mainstream competitors.
翻译:核心集方法,也称为子采样或子集选择,旨在选择一个子样本作为观察样本的代理。这种方法已广泛用于大规模数据分析。现有的核心集方法使用来自预测矩阵中的行集的子集构建子样本。当预测矩阵稀疏或数值稀疏时,这种方法可能会显着低效。为了克服这个限制,我们开发了一种用于传统线性回归中大规模最小二乘估计的新型基于元素的子集选择方法,称为核心元素。我们提供一种确定性算法来构造核心元素估计器,仅需要一个$O(\mbox{nnz}(\mathbf{X})+rp^2)$的计算成本,其中$\mathbf{X}$是一个$n\times p$预测矩阵,$r$是从每个$\mathbf{X}$列中选择的元素数量,$\mbox{nnz}(\cdot)$表示非零元素的数量。从理论上讲,我们证明了所提出的估计器是无偏的,并且近似最小化了估计方差的上界。通过为所提出的估算器推导类似核心集的有限样本界限,我们还提供了一个近似保证。为了处理数据中可能存在的异常值,我们进一步将核心元素与均值的中位数过程相结合,得到一个具有理论一致性保证的高效和稳健的估计器。在各种合成和开源数据集上进行的数值研究证明了所提出方法与主流竞争对手相比具有更好的性能。