We study the top-k set similarity search problem using semantic overlap. While vanilla overlap requires exact matches between set elements, semantic overlap allows elements that are syntactically different but semantically related to increase the overlap. The semantic overlap is the maximum matching score of a bipartite graph, where an edge weight between two set elements is defined by a user-defined similarity function, e.g., cosine similarity between embeddings. Common techniques like token indexes fail for semantic search since similar elements may be unrelated at the character level. Further, verifying candidates is expensive (cubic versus linear for syntactic overlap), calling for highly selective filters. We propose KOIOS, the first exact and efficient algorithm for semantic overlap search. KOIOS leverages sophisticated filters to minimize the number of required graph-matching calculations. Our experiments show that for medium to large sets less than 5% of the candidate sets need verification, and more than half of those sets are further pruned without requiring the expensive graph matching. We show the efficiency of our algorithm on four real datasets and demonstrate the improved result quality of semantic over vanilla set similarity search.
翻译:KOIOS: 基于语义重叠的 top-k 集合相似度搜索问题的研究。虽然传统的重叠要求集合元素的完全匹配,但是语义重叠允许语法上不同但语义相关的元素增加重叠。语义重叠是一个二分图的最大匹配分数,其中两个集合元素之间的边权值由用户定义的相似度函数(例如嵌入之间的余弦相似度)定义。传统的技术如标记索引在语义搜索上失灵,因为相似的元素可以在字符级别上毫不相关。此外,候选集的验证非常昂贵(对于语法上的重叠,为立方级别,而对于语义重叠,为线性级别),需要高度选择性的过滤器。我们提出了 KOIOS,这是一种基于精心设计的过滤器的确切且高效的算法,以最小化所需的图匹配计算数量。我们的实验证明,对于中大型集合,不到 5% 的候选集需要验证,其中超过一半的集合在不需要进行昂贵的图匹配的情况下被进一步修剪。我们展示了我们的算法在四个真实数据集上的效率,并演示了语义优于传统的集合相似度搜索的结果质量提高。