用于大数据应用的串流单项值分解 (Streaming Singular Value Decomposition for Big Data Applications)

Singular Value Decomposition (SVD) plays a pivotal role in exploratory data analysis. However, in a Big Data setting computing the dominant singular vectors is often restrictive due to the main memory requirements imposed by the dataset. Recently introduced randomized projection schemes attempt to mitigate this memory load by constructing approximate projections of the true dataset in a streaming setting. However, these projection methods come at the cost of approximation errors in both top singular values and vectors. Furthermore, in order to bound the approximation error, an over-sampled projection is required, often much larger in dimension than the desired rank. This latter consideration can still be memory intensive when the data dimension is large or extraneous when the desired rank approximation is close to the full rank. We present a two stage neural optimization approach as an alternative to conventional and randomized SVD techniques, where the memory requirement depends explicitly on the feature dimension and desired rank, independent of the sample size. The proposed scheme reads data samples in a streaming setting with the network minimization problem converging to a low rank approximation with high precision. Our architecture is fully interpretable where all the network outputs and weights have a specific meaning. We evaluate our results on various performance metrics against state of the art streaming methods. We also present numerical experiments for Singular and Eigen value decomposition on real data at various scales to show the memory efficiency of our proposed approach.

翻译：在探索性数据分析中,单值分解(SVD)在探索性数据分析中发挥着关键作用。然而,在计算主要单向矢量的大数据设置中,由于数据集规定的主要记忆要求,主要单向矢量往往具有限制性。最近采用的随机化预测计划试图通过在流流环境中对真实数据集进行大致的预测来减轻这一记忆负荷。然而,这些预测方法是以最高单值和矢量的近似误差为代价的。此外,为了控制近似误差,需要过度抽样的预测,其尺寸往往比预期的级别要大得多。在数据尺寸接近整级时,如果数据尺寸大或超值,则后一种考虑仍然可能是记忆密集的。我们提出了两种阶段性神经优化方法,作为常规和随机化的SVD技术的一种替代办法,其中的内存要求明确取决于特性尺寸和期望的等级,而与抽样大小无关。此外,拟议的方案是将数据样品放在一个流流中,将问题最小化为低级近似的级别近似值。当想要的级别接近全网络产出和重量接近整级时,我们的结构仍然可以充分解释。我们的结构在网络输出输出输出和重量接近接近全级时,而其全部的网络产出和重量接近接近全级时,我们现在的网络输出和重量接近全程的网络输出和重量接近全程中,我们的具体度最具体的内流中,我们所展示的精确度最具体的内流的数据级性级性级性级性级性级性试验。我们用来用来表示E级试验。我们所处。我们所研算。我们所研订制的内程中,我们所研算。我们所研算。我们所思式的内程的内程的精确度,我们所思式的内程中,我们所研判的内程的内程的内程的内程的内程的内程的内程的内程中,我们所思。