Dimensionality reduction is a critical step in scaling machine learning pipelines. Principal component analysis (PCA) is a standard tool for dimensionality reduction, but performing PCA over a full dataset can be prohibitively expensive. As a result, theoretical work has studied the effectiveness of iterative, stochastic PCA methods that operate over data samples. However, termination conditions for stochastic PCA either execute for a predetermined number of iterations, or until convergence of the solution, frequently sampling too many or too few datapoints for end-to-end runtime improvements. We show how accounting for downstream analytics operations during DR via PCA allows stochastic methods to efficiently terminate after operating over small (e.g., 1%) subsamples of input data, reducing whole workload runtime. Leveraging this, we propose DROP, a DR optimizer that enables speedups of up to 5x over Singular-Value-Decomposition-based PCA techniques, and exceeds conventional approaches like FFT and PAA by up to 16x in end-to-end workloads.
翻译:减少尺寸是扩大机器学习管道的关键一步。主元件分析(PCA)是降低维度的标准工具,但是在全数据集中进行五氯苯甲醚的操作成本极高。因此,理论工作研究了在数据样品中操作的迭接、随机的五氯苯甲醚方法的有效性。然而,对随机的五氯苯甲醚的终止条件或者执行预先确定的迭代数,或者直到解决方案趋于一致,经常为终端到终端运行时间的改进对数据点进行过多或过少的取样。我们展示了如何在通过五氯苯甲醚进行下游分析操作期间进行会计核算,使得在小型(例如1%)投入数据分样操作后,随机分析方法能够有效终止,减少整个工作量运行时间。我们为此建议DROP,即DROP,一个DR优化器,使Singulal-Value-Decomposition的五氯苯甲醚技术的加速速度达到5x以上,并且超过FFT和PAAAA的常规方法,在终端到终端工作量中达到16x。