In streaming Singular Value Decomposition (SVD), $d$-dimensional rows of a possibly infinite matrix arrive sequentially as points in $\mathbb{R}^d$. An $\epsilon$-coreset is a (much smaller) matrix whose sum of square distances of the rows to any hyperplane approximates that of the original matrix to a $1 \pm \epsilon$ factor. Our main result is that we can maintain a $\epsilon$-coreset while storing only $O(d \log^2 d / \epsilon^2)$ rows. Known lower bounds of $\Omega(d / \epsilon^2)$ rows show that this is nearly optimal. Moreover, each row of our coreset is a weighted subset of the input rows. This is highly desirable since it: (1) preserves sparsity; (2) is easily interpretable; (3) avoids precision errors; (4) applies to problems with constraints on the input. Previous streaming results for SVD that return a subset of the input required storing $\Omega(d \log^3 n / \epsilon^2)$ rows where $n$ is the number of rows seen so far. Our algorithm, with storage independent of $n$, is the first result that uses finite memory on infinite streams. We support our findings with experiments on the Wikipedia dataset benchmarked against state-of-the-art algorithms.
翻译:在串流 Singulal 值分解( SVD) 中, $d- 维维值数行中, 一个可能无限的矩阵的美元- 维值行依次以美元=mathb{R ⁇ d$ $. ==d$ ==d$。 $silon$- coolset 是一个( 大大小的) 矩阵, 该矩阵将各行的平方距离与任何超高机的平方距离相近, 接近于原始矩阵中的1美元=pm \ pm \ \ = epsilon 系数。 我们的主要结果是, 我们可以保持一个 $psilable ; (2) 容易解释; (3) 避免精确错误; (4) 适用于输入限制 。 SVD 先前的流结果, 将 $xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxal_ ral_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx