基于哈希的海量高维数据近似最近邻查询研究

项目名称： 基于哈希的海量高维数据近似最近邻查询研究

项目编号： No.61472298

项目类型： 面上项目

立项/批准年度： 2015

项目学科： 自动化技术、计算机技术

项目作者： 崔江涛

作者单位： 西安电子科技大学

项目金额： 80万元

中文摘要： 本课题针对海量高维数据带来的存储、计算复杂度过高的问题，研究哈希理论及其在最近邻查询中的应用。哈希作为一种数据紧致表达的有效手段已经得到广泛应用，但是在处理近似最近邻查询时依然存在缺陷，现有方法或者以巨大的空间开销换取时间高效性，或者以时间开销节省存储空间。本课题首先针对最近邻查询，研究并提出一种返回近似第I个近邻的新型查询问题。其次，分析高维数据的哈希映射机理，提出面向近似最近邻查询的哈希计算模型，建立基于线序的高维向量空间索引与查询框架，解决现有方法存储开销巨大问题；最后，针对哈希映射后的海明空间，研究面向海明距离的索引与查询机理，实现海明空间中的高效最近邻查询，解决目前方法查询复杂度过高问题。本质上，哈希用于近似最近邻查询，存在碰撞过滤和压缩表达两种截然不同的解决思路，本课题融合两种思路，其特色是同时实现存储空间和查询时间的高效性，满足大数据环境下海量高维数据的存储和查询需求。

中文关键词： 海量数据管理；高维数据；近似最近邻查询；哈希方法

英文摘要： To address huge complexity of storage and computation that is brought about by large-scale high-dimensional data, we explore the hashing theory and its application in nearest neighbor search in this project. Though hashing technology is widely used as an effective method for compact representation of high-dimensional data, there still exists some drawbacks when dealing with approximate nearest neighbor search. Existing methods either obtains their efficiency in time by costing a huge amount of space or saves the space by sacrificing time. In this project, we first propose a novel version of approximate nearest neighbor problem, called I-th approximate nearest neighbor. Then, based on the analysis of the mechanism of hash mappings for high-dimensional data, we propose a computing model of hashing for approximate nearest neighbor search and build a framework of high-dimensional indexing and search based on linear order structures, in order to solve the issue of huge storage for existing methods. Finally, as for the hashed hamming space, we explore the mechanism of indexing and search for hamming distance and enhance the efficiency of nearest neighbor search in hamming space, as well as solve the high complexity of search for existing methods. In essence, there are two different solutions, collision and filtering and compression and representation, for hashing to solve approximate nearest neighbor search, which could be combined together in this project. Its main feature is to bring about the efficiency in storage space and search time simultaneously and further satisfy the requirement of storage and search for large-scale high-dimensional data in the environment of big data.

英文关键词： large-scale data management;high-dimensional data;approximate nearest neighbor search;hashing

成为VIP会员查看完整内容