子集和超大型: 最佳散列法 (Subsets and Supermajorities: Optimal Hashing-based Set Similarity Search)

We formulate and optimally solve a new generalized Set Similarity Search problem, which assumes the size of the database and query sets are known in advance. By creating polylog copies of our data-structure, we optimally solve any symmetric Approximate Set Similarity Search problem, including approximate versions of Subset Search, Maximum Inner Product Search (MIPS), Jaccard Similarity Search and Partial Match. Our algorithm can be seen as a natural generalization of previous work on Set as well as Euclidean Similarity Search, but conceptually it differs by optimally exploiting the information present in the sets as well as their complements, and doing so asymmetrically between queries and stored sets. Doing so we improve upon the best previous work: MinHash [J. Discrete Algorithms 1998], SimHash [STOC 2002], Spherical LSF [SODA 2016, 2017] and Chosen Path [STOC 2017] by as much as a factor $n^{0.14}$ in both time and space; or in the near-constant time regime, in space, by an arbitrarily large polynomial factor. Turning the geometric concept, based on Boolean supermajority functions, into a practical algorithm requires ideas from branching random walks on $\mathbb Z^2$, for which we give the first non-asymptotic near tight analysis. Our lower bounds follow from new hypercontractive arguments, which can be seen as characterizing the exact family of similarity search problems for which supermajorities are optimal. The optimality holds for among all hashing based data structures in the random setting, and by reductions, for 1 cell and 2 cell probe data structures. As a side effect, we obtain new hypercontractive bounds on the directed noise operator $T^{p_1 \to p_2}_\rho$.

翻译：我们制定并最佳地解决一个新的通用的“ 相近搜索” 问题, 假设数据库和查询组的大小是已知的。我们通过创建数据结构的多元副本, 最优化地解决任何对称的“ 近似” 相近的“ 相似搜索” 问题, 包括子集搜索、最大产品搜索( MIPS)、雅克卡尔相似搜索和部分匹配的近似版本。我们的算法可以被视为Set 和 Euclidean 相似搜索( Euclidean 相似搜索) 先前的工作的自然概括化, 但概念上的差异在于优化地利用各组中的信息及其补充, 并在查询和存储的数据集之间做出不对称的处理。我们改进了以往最佳的特性: MinHash [J. Discrete Algorioms 1998], SimHash [STOC 2002], Spellical LSF[SF [SD, 2016, 2017] 和Chosen 路径[STOC 2017], 的算法, 在时间和空间的系数中, 中, 都可以取一个系数值为 $0.0.14} 的系数,, ; 或者在接近的单元格中, 在空间中, 在接近的单元格中, 中, 中, 直立值中, 通过一个直立点中, 通过一个任意的缩缩缩缩缩缩的缩的缩的缩中的数据,,,, 将数据功能的缩缩缩缩。