The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample. Such an approach has been used pervasively in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome the limitation, we develop a novel element-wise subset selection approach, called core-elements. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an $O(\mbox{nnz}(\mathbf{X})+rp^2)$ computational cost, where $\mathbf{X}$ is an $n\times p$ predictor matrix, $r$ is the number of elements selected from each column of $\mathbf{X}$, and $\mbox{nnz}(\cdot)$ denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method's superior performance compared to mainstream competitors.
翻译:核心设置方法, 也称为子抽样或子集选择, 目的是选择子样本作为观察到的样本的代名词。 这种方法在大规模数据分析中被广泛使用。 现有的核心设置方法使用预测矩阵的一组行来构建子样本。 当预测矩阵稀少或数字稀少时, 这种方法可能非常低效。 为了克服限制, 我们开发了一个新颖的元素子选择方法, 叫做核心要素。 我们提供了一种确定性算法, 用于构建核心目标估计器, 只需要一个 $( mbox{ nnz} (mathbbf{X})+rp2) 的高级数据分析成本 。 $\ mathb{X} 现有的核心设置方法在预测矩阵少或数字少时, 这种方法可能非常低效率 。 我们用一个最小值计算器的计算值计算值计算值, 并用一个最精确的计算法, 我们用一个最精确的计算方法来显示一个最精确的数值 。 我们用一个最精确的计算方法,, 将一个最精确的数值排序的计算方法, 我们用一个最精确的计算方法来显示一个最精确的数值 的计算方法 。