One of the most common operations in multimodal scientific data management is searching for the $k$ most similar items (or, $k$-nearest neighbors, KNN) from the database after being provided a new item. Although recent advances of multimodal machine learning models offer a \textit{semantic} index, the so-called \textit{embedding vectors} mapped from the original multimodal data, the dimension of the resulting embedding vectors are usually on the order of hundreds or a thousand, which are impractically high for time-sensitive scientific applications. This work proposes to reduce the dimensionality of the output embedding vectors such that the set of top-$k$ nearest neighbors do not change in the lower-dimensional space, namely Order-Preserving Dimension Reduction (OPDR). In order to develop such an OPDR method, our central hypothesis is that by analyzing the intrinsic relationship among key parameters during the dimension-reduction map, a quantitative function may be constructed to reveal the correlation between the target (lower) dimensionality and other variables. To demonstrate the hypothesis, this paper first defines a formal measure function to quantify the KNN similarity for a specific vector, then extends the measure into an aggregate accuracy of the global metric spaces, and finally derives a closed-form function between the target (lower) dimensionality and other variables. We incorporate the closed-function into popular dimension-reduction methods, various distance metrics, and embedding models.
翻译:多模态科学数据管理中最常见的操作之一是在提供新数据项后,从数据库中搜索与其最相似的k个数据项(即k近邻,KNN)。尽管多模态机器学习模型的最新进展提供了语义索引——即从原始多模态数据映射得到的嵌入向量,但所得嵌入向量的维度通常高达数百甚至上千,这对于时间敏感的科学应用而言并不实用。本研究提出降低输出嵌入向量的维度,使得在降维后的空间中前k个最近邻的集合保持不变,即保序降维(OPDR)。为构建此类OPDR方法,我们的核心假设是:通过分析降维映射过程中关键参数的内在关联,可以构建量化函数以揭示目标(较低)维度与其他变量之间的相关性。为验证该假设,本文首先定义形式化度量函数来量化特定向量的KNN相似度,随后将该度量扩展为全局度量空间的聚合精度,最终推导出目标(较低)维度与其他变量之间的闭式函数。我们将该闭式函数集成到主流降维方法、多种距离度量及嵌入模型中。