以k- means 群集方式有效且粗雅的计票- sketch (Effective and Sparse Count-Sketch via k-means clustering)

Count-sketch is a popular matrix sketching algorithm that can produce a sketch of an input data matrix X in O(nnz(X))time where nnz(X) denotes the number of non-zero entries in X. The sketched matrix will be much smaller than X while preserving most of its properties. Therefore, count-sketch is widely used for addressing high-dimensionality challenge in machine learning. However, there are two main limitations of count-sketch: (1) The sketching matrix used count-sketch is generated randomly which does not consider any intrinsic data properties of X. This data-oblivious matrix sketching method could produce a bad sketched matrix which will result in low accuracy for subsequent machine learning tasks (e.g.classification); (2) For highly sparse input data, count-sketch could produce a dense sketched data matrix. This dense sketch matrix could make the subsequent machine learning tasks more computationally expensive than on the original sparse data X. To address these two limitations, we first show an interesting connection between count-sketch and k-means clustering by analyzing the reconstruction error of the count-sketch method. Based on our analysis, we propose to reduce the reconstruction error of count-sketch by using k-means clustering algorithm to obtain the low-dimensional sketched matrix. In addition, we propose to solve k-mean clustering using gradient descent with -L1 ball projection to produce a sparse sketched matrix. Our experimental results based on six real-life classification datasets have demonstrated that our proposed method achieves higher accuracy than the original count-sketch and other popular matrix sketching algorithms. Our results also demonstrate that our method produces a sparser sketched data matrix than other methods and therefore the prediction cost of our method will be smaller than other matrix sketching methods.

翻译：计票是一种流行的矩阵草图算法,它可以产生O(nnz(X))时间输入数据矩阵X的草图,而nnz(X) 时间中输入数据矩阵的草图可以显示X的非零条目数量。草图矩阵在保存其大部分属性的同时会比X要小得多。因此,计数矩阵被广泛用来应对机器学习中的高度挑战。然而,计票背景的两个主要局限性是:(1) 草图矩阵使用的点数矩阵是随机生成的,它不考虑X的任何内在数据性质。这种数据模糊的矩阵绘制方法可能会产生一个错误的草图矩阵,从而导致随后的机器学习任务(例如,分类); 草图矩阵矩阵将大大低于X。这种粗略的矩阵方法将减少我们最初的底线矩阵错误,因此,对于高度粗略的输入数据数据矩阵数据,我们提议以更低的基数计算方法来计算。