项目名称: 面向长尾现象的数据缓存技术研究
项目编号: No.61502189
项目类型: 青年科学基金项目
立项/批准年度: 2016
项目学科: 自动化技术、计算机技术
项目作者: 王桦
作者单位: 华中科技大学
项目金额: 20万元
中文摘要: 大数据访问模式由传统的Zipf分布变为扩展指数(SE)分布,传统数据缓存技术不再适用于大数据访问。造成SE分布的长尾现象及缓存效率远低于Zipf分布的根源在于大数据访问的局部性减弱且缓存空间不足。本项目提出面向长尾现象的大数据缓存结构,通过挖掘热文件中所包含的冷文件数据块,在保证热文件缓存访问命中率的前提下,提高冷文件缓存访问命中率;提出基于属性集和相似度检测的文件分类方法来实现分布式缓存管理,采用局部敏感哈希技术对文件进行分组,缩小重复数据的查找范围;进一步采用动态计数型布隆过滤器阵列技术加快重复数据的判断,提高缓存检索长尾全集冷数据的性能。本项目突破传统缓存研究只针对热数据的思维定势,聚焦规模及价值不断增长的SE分布下的冷数据,为大数据缓存设计提供新的思路。
中文关键词: 数据缓存;访问模式;大数据;数据重删;局部敏感哈希
英文摘要: Access pattern of big data has changed from traditional Zipf-like distribution to Stretched Exponential (SE) distribution, conventional caching approaches are no longer suitable for big data access. The root cause for SE distribution’s long tail and lower caching efficiency than Zipf-like distribution is that big data access locality is weaker and cache space is insufficient. In this project, we proposed long tail phenomenon oriented big data caching structure, so as to improve cold file hit ratio through exploiting blocks of cold file co-resided in hot files, as well as guaranteeing hot files’ hit ratio; We also proposed attribute set and similarity detection based file classification to realize distributed caching management; Locality-Sensitive Hashing technology was adopted to group similar files and narrow query scope of duplicated data; Furthermore, Dynamic Counting Bloom filter Array was used to accelerate the judgment of duplicated items, so as to improve the performance of searching full set of cold data in long tail. In this project, we broke the regular thinking pattern on caching research where only hot data are focused and paid attention to the cold data with increasingly higher volume and value, so as to provide new solution for big data caching.
英文关键词: data caching;access pattern;big data;data deduplication;locality-sensitive hashing