Maximum Inner Product Search or top-k retrieval on sparse vectors is well-understood in information retrieval, with a number of mature algorithms that solve it exactly. However, all existing algorithms are tailored to text and frequency-based similarity measures. To achieve optimal memory footprint and query latency, they rely on the near stationarity of documents and on laws governing natural languages. We consider, instead, a setup in which collections are streaming -- necessitating dynamic indexing -- and where indexing and retrieval must work with arbitrarily distributed real-valued vectors. As we show, existing algorithms are no longer competitive in this setup, even against naive solutions. We investigate this gap and present a novel approximate solution, called Sinnamon, that can efficiently retrieve the top-k results for sparse real valued vectors drawn from arbitrary distributions. Notably, Sinnamon offers levers to trade-off memory consumption, latency, and accuracy, making the algorithm suitable for constrained applications and systems. We give theoretical results on the error introduced by the approximate nature of the algorithm, and present an empirical evaluation of its performance on two hardware platforms and synthetic and real-valued datasets. We conclude by laying out concrete directions for future research on this general top-k retrieval problem over sparse vectors.
翻译:对稀有矢量的最大产品搜索或顶点检索在信息检索中被完全理解,信息检索中有一定数量的成熟算法可以完全解决这个问题。然而,所有现有的算法都是根据文本和基于频率的类似措施定制的。为了实现最佳的记忆足迹和查询延缓度,它们依赖于文件的接近静止性和自然语言管理法。相反,我们考虑的是集集的设置 -- -- 需要动态索引化 -- -- 以及索引和检索必须同任意分布的真实价值矢量起作用。正如我们所显示的那样,现有的算法在这个设置中不再具有竞争力,甚至与天真的解决方案相对。我们调查了这一差距,并提出了一个新的近似解决办法,称为Sinnanon,它能够有效地检索从任意分布中提取的稀有真正价值矢量的矢量的最大结果。特别是Sinnanon提供了交换记忆消耗、纬度和精确度的杠杆,使算法适合于受限制的应用程序和系统。我们从逻辑的大致性质中得出了错误的理论结果,并展示了它在两个硬件平台上的表现以及合成和不断变现的矢量的矢量数据,我们通过对未来进行一般性的具体研究而得出了这一结果。