Algorithmic Gaussianization is a phenomenon that can arise when using randomized sketching or sampling methods to produce smaller representations of large datasets: For certain tasks, these sketched representations have been observed to exhibit many robust performance characteristics that are known to occur when a data sample comes from a sub-gaussian random design, which is a powerful statistical model of data distributions. However, this phenomenon has only been studied for specific tasks and metrics, or by relying on computationally expensive methods. We address this by providing an algorithmic framework for gaussianizing data distributions via averaging, proving that it is possible to efficiently construct data sketches that are nearly indistinguishable (in terms of total variation distance) from sub-gaussian random designs. In particular, relying on a recently introduced sketching technique called Leverage Score Sparsified (LESS) embeddings, we show that one can construct an $n\times d$ sketch of an $N\times d$ matrix $A$, where $n\ll N$, that is nearly indistinguishable from a sub-gaussian design, in time $O(\text{nnz}(A)\log N + nd^2)$, where $\text{nnz}(A)$ is the number of non-zero entries in $A$. As a consequence, strong statistical guarantees and precise asymptotics available for the estimators produced from sub-gaussian designs (e.g., for least squares and Lasso regression, covariance estimation, low-rank approximation, etc.) can be straightforwardly adapted to our sketching framework. We illustrate this with a new approximation guarantee for sketched least squares, among other examples.
翻译:在使用随机的草图或取样方法以产生大型数据集的较小表达形式时,可能出现这样的现象:在某些任务中,观察到这些草图的表示方式展示了许多可靠的性能特征,当数据样本来自一个亚加西语随机设计时,人们就会知道这些特征。这是一个强大的数据分布统计模型。然而,这个现象只是为特定任务和指标而研究的,或者依靠计算成本昂贵的方法。我们通过提供一种算法框架,通过平均方式使数据分布更加直接化,证明能够有效地构建与亚加西语随机设计几乎无法分辨的数据草图(按总变异距离计算)。 特别是,我们依靠最近引入的素描图技术,叫做“加分分化(LESS)”嵌入,我们只能用一个美元的时间来构建一个以美元计算的低价矩阵的草图,用美元来调整数据基数(美元),用美元,用美元来调整,用美元(美元),用美元,用美元,用美元,从亚西语的精确值估算中几乎无法区分的数据。