内产产品估算的加权中度散逸量比线性线性产品切换 (Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation)

We present a new approach for computing compact sketches that can be used to approximate the {inner product} between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees that improve on the guarantees of popular \emph{linear sketching} approaches for inner product estimation, such as CountSketch and Johnson-Lindenstrauss projection. Specifically, while our method admits guarantees that exactly match linear sketching for dense vectors, it yields significantly \emph{lower} error for sparse vectors with limited overlap between non-zero entries. Such vectors arise in many applications involving sparse data. They are also important in increasingly popular dataset search applications, where inner product sketches are used to estimate data covariance, conditional means, and other quantities involving columns in \emph{unjoined tables}. We complement our theoretical results by showing that our approach empirically outperforms existing linear sketches and unweighted hashing-based sketches for sparse vectors.

翻译：我们提出了一个计算高维矢量两对之间[内产成的缩略图的新方法。根据加权 MinHash 算法,我们的方法承认了强大的准确性保证,这些保证改善了对内产物估计的流行 emph{线性草图的保障,例如伯爵史克特和约翰逊-林登斯特拉斯投影。具体地说,虽然我们的方法承认保证密度矢量的线性草图与密度矢量完全吻合,但对非零条目之间重叠有限的稀散矢量则会产生显著的 emph{lower} 差错。这些矢量出现在涉及稀薄数据的多种应用中。这些矢量在日益流行的数据集搜索应用程序中也很重要,在这些应用程序中,内部产品草图用于估计数据的共性、有条件手段和其他数量,以及计算\emph{unjoined表格中的列值。我们的理论结果是补充我们的理论结果,显示我们的方法在经验上优于现有线性草图和未加权的稀散矢量的散矢量草图。