We consider a similarity measure between two sets $A$ and $B$ of vectors, that balances the average and maximum cosine distance between pairs of vectors, one from set $A$ and one from set $B$. As a motivation for this measure, we present lineage tracking in a database. To practically realize this measure, we need an approximate search algorithm that given a set of vectors $A$ and sets of vectors $B_1,...,B_n$, the algorithm quickly locates the set $B_i$ that maximizes the similarity measure. For the case where all sets are singleton sets, essentially each is a single vector, there are known efficient approximate search algorithms, e.g., approximated versions of tree search algorithms, locality-sensitive hashing (LSH), vector quantization (VQ) and proximity graph algorithms. In this work, we present approximate search algorithms for the general case. The underlying idea in these algorithms is encoding a set of vectors via a "long" single vector. The proposed approximate approach achieves significant performance gains over an optimized, exact search on vector sets.
翻译:我们认为两种矢量的相似度度是两套A美元和两套B美元之间的相似度度量,这种量度平衡了两套矢量的平均值和最大余弦距离,一对设定美元,一对设定美元,一对设定美元,一对设定美元。作为这一度量的动机,我们在数据库中提供线系跟踪。为了实际实现这一度量,我们需要一种近似搜索算法,根据一套矢量的矢量值和几套矢量的值,一美元、一美元和一美元,算法迅速定位了一组美元,使相似度量度量最大化。对于所有数据集都是单吨数的情况,基本上每套都是单一矢量,则有已知的有效近似搜索算法,例如树木搜索算法的近似版本、对地点敏感的散射法(LSH)、矢量定量(VQ)和近距离图算法。在这项工作中,我们提出了一般情况的近似搜索算法。这些算法的基本想法是通过“长期”单一矢量对一套矢量进行编码。拟议的近似方法在优化矢量、精确的矢量组合上取得了显著的绩效。