面向长尾现象的数据缓存技术研究 - 专知基金

会员服务 ·

0

数据缓存 · 访问模式 · 大数据 · 数据重删 · 局部敏感哈希 ·

2015 年 12 月 31 日

面向长尾现象的数据缓存技术研究

国家自然科学基金

国家自然科学基金委员会

项目名称： 面向长尾现象的数据缓存技术研究

项目编号： No.61502189

项目类型： 青年科学基金项目

立项/批准年度： 2016

项目学科： 自动化技术、计算机技术

项目作者： 王桦

作者单位： 华中科技大学

项目金额： 20万元

中文摘要： 大数据访问模式由传统的Zipf分布变为扩展指数（SE）分布，传统数据缓存技术不再适用于大数据访问。造成SE分布的长尾现象及缓存效率远低于Zipf分布的根源在于大数据访问的局部性减弱且缓存空间不足。本项目提出面向长尾现象的大数据缓存结构，通过挖掘热文件中所包含的冷文件数据块，在保证热文件缓存访问命中率的前提下，提高冷文件缓存访问命中率；提出基于属性集和相似度检测的文件分类方法来实现分布式缓存管理，采用局部敏感哈希技术对文件进行分组，缩小重复数据的查找范围；进一步采用动态计数型布隆过滤器阵列技术加快重复数据的判断，提高缓存检索长尾全集冷数据的性能。本项目突破传统缓存研究只针对热数据的思维定势，聚焦规模及价值不断增长的SE分布下的冷数据，为大数据缓存设计提供新的思路。

中文关键词： 数据缓存；访问模式；大数据；数据重删；局部敏感哈希

英文摘要： Access pattern of big data has changed from traditional Zipf-like distribution to Stretched Exponential (SE) distribution, conventional caching approaches are no longer suitable for big data access. The root cause for SE distribution’s long tail and lower caching efficiency than Zipf-like distribution is that big data access locality is weaker and cache space is insufficient. In this project, we proposed long tail phenomenon oriented big data caching structure, so as to improve cold file hit ratio through exploiting blocks of cold file co-resided in hot files, as well as guaranteeing hot files’ hit ratio; We also proposed attribute set and similarity detection based file classification to realize distributed caching management; Locality-Sensitive Hashing technology was adopted to group similar files and narrow query scope of duplicated data; Furthermore, Dynamic Counting Bloom filter Array was used to accelerate the judgment of duplicated items, so as to improve the performance of searching full set of cold data in long tail. In this project, we broke the regular thinking pattern on caching research where only hot data are focused and paid attention to the cold data with increasingly higher volume and value, so as to provide new solution for big data caching.

英文关键词： data caching;access pattern;big data;data deduplication;locality-sensitive hashing

成为VIP会员查看完整内容

0

相关内容

数据缓存

面向语义搜索的自然语言处理

面向语义搜索的自然语言处理

专知会员服务

60+阅读 · 2021年12月18日

面向大数据处理框架的JVM优化技术综述

面向大数据处理框架的JVM优化技术综述

专知会员服务

17+阅读 · 2021年11月27日

【清华大学陈游旻博士论文】持久性内存存储系统关键技术研究

【清华大学陈游旻博士论文】持久性内存存储系统关键技术研究

专知会员服务

29+阅读 · 2021年11月24日

面向知识图谱的知识推理综述

面向知识图谱的知识推理综述

专知会员服务

152+阅读 · 2021年11月1日

【KDD2021-谷歌】面向推荐的学习在不嵌入表的情况下嵌入类别特征

专知会员服务

11+阅读 · 2021年8月17日

中国信通院发布《下一代数据存储技术研究报告（2021年）》（附pdf）

专知会员服务

46+阅读 · 2021年7月13日

面向自然语言处理任务的预训练模型综述

专知会员服务

61+阅读 · 2021年5月28日

大数据平台安全研究报告，36页pdf

专知会员服务

42+阅读 · 2021年3月28日

面向图的异常检测研究综述

专知会员服务

109+阅读 · 2020年10月27日

面向知识图谱的信息抽取

专知会员服务

200+阅读 · 2020年10月14日

腾讯数据湖查询优化实践

腾讯数据湖查询优化实践

专知

3+阅读 · 2022年3月24日

《2021—2022中国大数据产业发展报告》

《2021—2022中国大数据产业发展报告》

专知

12+阅读 · 2022年1月23日

面向未来，分布式数据库能有哪些新的突破

面向未来，分布式数据库能有哪些新的突破

CSDN

0+阅读 · 2022年1月17日

云原生数据仓库AnalyticDB支撑双11，大幅提升分析实时性和用户体验

云原生数据仓库AnalyticDB支撑双11，大幅提升分析实时性和用户体验

阿里技术

0+阅读 · 2021年12月2日

面向大数据处理框架的JVM优化技术综述

面向大数据处理框架的JVM优化技术综述

专知

0+阅读 · 2021年11月27日

【博士论文】持久性内存存储系统关键技术研究

【博士论文】持久性内存存储系统关键技术研究

专知

2+阅读 · 2021年11月24日

PostgreSQL数据目录深度揭秘

PostgreSQL数据目录深度揭秘

阿里技术

0+阅读 · 2021年8月31日

一文读懂线性回归、岭回归和Lasso回归

一文读懂线性回归、岭回归和Lasso回归

CSDN

34+阅读 · 2019年10月13日

亿级订单数据的访问与存储，怎么实现与优化？

亿级订单数据的访问与存储，怎么实现与优化？

码农翻身

16+阅读 · 2019年4月17日

面向云端融合的分布式计算技术研究进展与趋势

面向云端融合的分布式计算技术研究进展与趋势

中国计算机学会

19+阅读 · 2018年11月27日

面向大规模数据流的集成学习模型与方法研究

国家自然科学基金

5+阅读 · 2014年12月31日

面向大数据的渐进式集成学习方法与分布式算法研究

国家自然科学基金

2+阅读 · 2014年12月31日

面向敏感数据的云计算安全存储问题研究

国家自然科学基金

1+阅读 · 2014年12月31日

长尾延迟优化的在线数据密集型计算运行环境支撑技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

海量数据处理中面向任务加速的数据调度策略研究

国家自然科学基金

2+阅读 · 2013年12月31日

面向虚拟化云服务器的智能高速缓存管理

国家自然科学基金

0+阅读 · 2012年12月31日

面向MapReduce的网络存储系统优化技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

云计算环境下海量数据查询优化与智能处理的研究

国家自然科学基金

1+阅读 · 2011年12月31日

面向大规模数据的机器学习算法研究

国家自然科学基金

9+阅读 · 2011年12月31日

面向大规模RDF数据的分布式处理技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

Time Domain Adversarial Voice Conversion for ADD 2022

Arxiv

1+阅读 · 2022年4月20日

Unsupervised detection of ash dieback disease (Hymenoscyphus fraxineus) using diffusion-based hyperspectral image clustering

Unsupervised detection of ash dieback disease (Hymenoscyphus fraxineus) using diffusion-based hyperspectral image clustering

Arxiv

0+阅读 · 2022年4月19日

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

Arxiv

0+阅读 · 2022年4月19日

Usage of specific attention improves change point detection

Arxiv

0+阅读 · 2022年4月18日

Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences

Arxiv

0+阅读 · 2022年4月15日

Tensor Decompositions for temporal knowledge base completion

Arxiv

10+阅读 · 2020年4月10日

Graph Enhanced Representation Learning for News Recommendation

Arxiv

24+阅读 · 2020年3月31日

K-BERT: Enabling Language Representation with Knowledge Graph

K-BERT: Enabling Language Representation with Knowledge Graph

Arxiv

19+阅读 · 2019年9月17日

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Arxiv

16+阅读 · 2019年5月24日

Deep Representation Learning for Domain Adaptation of Semantic Image Segmentation

Arxiv

10+阅读 · 2018年5月10日

阅读: 0 点赞: 0

小贴士

登录享主题订阅及个性化推荐

相关主题

局部敏感哈希

热门VIP内容

开通专知VIP会员享更多权益服务

《巡飞弹药（爆炸性无人机）威胁态势分析》最新24页报告

《军用后勤无人机：破解战场运输挑战的创新方案》

人工智能战争：以色列、伊朗与新型AI战争形态

《俄乌战争：现代战争未来的启示与经验》

相关VIP内容

面向语义搜索的自然语言处理

面向语义搜索的自然语言处理

专知会员服务

60+阅读 · 2021年12月18日

面向大数据处理框架的JVM优化技术综述

面向大数据处理框架的JVM优化技术综述

专知会员服务

17+阅读 · 2021年11月27日

【清华大学陈游旻博士论文】持久性内存存储系统关键技术研究

【清华大学陈游旻博士论文】持久性内存存储系统关键技术研究

专知会员服务

29+阅读 · 2021年11月24日

面向知识图谱的知识推理综述

面向知识图谱的知识推理综述

专知会员服务

152+阅读 · 2021年11月1日

【KDD2021-谷歌】面向推荐的学习在不嵌入表的情况下嵌入类别特征

专知会员服务

11+阅读 · 2021年8月17日

中国信通院发布《下一代数据存储技术研究报告（2021年）》（附pdf）

专知会员服务

46+阅读 · 2021年7月13日

面向自然语言处理任务的预训练模型综述

专知会员服务

61+阅读 · 2021年5月28日

大数据平台安全研究报告，36页pdf

专知会员服务

42+阅读 · 2021年3月28日

面向图的异常检测研究综述

专知会员服务

109+阅读 · 2020年10月27日

面向知识图谱的信息抽取

专知会员服务

200+阅读 · 2020年10月14日

相关资讯

腾讯数据湖查询优化实践

腾讯数据湖查询优化实践

专知

3+阅读 · 2022年3月24日

《2021—2022中国大数据产业发展报告》

《2021—2022中国大数据产业发展报告》

专知

12+阅读 · 2022年1月23日

面向未来，分布式数据库能有哪些新的突破

面向未来，分布式数据库能有哪些新的突破

CSDN

0+阅读 · 2022年1月17日

云原生数据仓库AnalyticDB支撑双11，大幅提升分析实时性和用户体验

云原生数据仓库AnalyticDB支撑双11，大幅提升分析实时性和用户体验

阿里技术

0+阅读 · 2021年12月2日

面向大数据处理框架的JVM优化技术综述

面向大数据处理框架的JVM优化技术综述

专知

0+阅读 · 2021年11月27日

【博士论文】持久性内存存储系统关键技术研究

【博士论文】持久性内存存储系统关键技术研究

专知

2+阅读 · 2021年11月24日

PostgreSQL数据目录深度揭秘

PostgreSQL数据目录深度揭秘

阿里技术

0+阅读 · 2021年8月31日

一文读懂线性回归、岭回归和Lasso回归

一文读懂线性回归、岭回归和Lasso回归

CSDN

34+阅读 · 2019年10月13日

亿级订单数据的访问与存储，怎么实现与优化？

亿级订单数据的访问与存储，怎么实现与优化？

码农翻身

16+阅读 · 2019年4月17日

面向云端融合的分布式计算技术研究进展与趋势

面向云端融合的分布式计算技术研究进展与趋势

中国计算机学会

19+阅读 · 2018年11月27日

相关基金

面向大规模数据流的集成学习模型与方法研究

国家自然科学基金

5+阅读 · 2014年12月31日

面向大数据的渐进式集成学习方法与分布式算法研究

国家自然科学基金

2+阅读 · 2014年12月31日

面向敏感数据的云计算安全存储问题研究

国家自然科学基金

1+阅读 · 2014年12月31日

长尾延迟优化的在线数据密集型计算运行环境支撑技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

海量数据处理中面向任务加速的数据调度策略研究

国家自然科学基金

2+阅读 · 2013年12月31日

面向虚拟化云服务器的智能高速缓存管理

国家自然科学基金

0+阅读 · 2012年12月31日

面向MapReduce的网络存储系统优化技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

云计算环境下海量数据查询优化与智能处理的研究

国家自然科学基金

1+阅读 · 2011年12月31日

面向大规模数据的机器学习算法研究

国家自然科学基金

9+阅读 · 2011年12月31日

面向大规模RDF数据的分布式处理技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

相关论文

Time Domain Adversarial Voice Conversion for ADD 2022

Arxiv

1+阅读 · 2022年4月20日

Unsupervised detection of ash dieback disease (Hymenoscyphus fraxineus) using diffusion-based hyperspectral image clustering

Unsupervised detection of ash dieback disease (Hymenoscyphus fraxineus) using diffusion-based hyperspectral image clustering

Arxiv

0+阅读 · 2022年4月19日

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

Arxiv

0+阅读 · 2022年4月19日

Usage of specific attention improves change point detection

Arxiv

0+阅读 · 2022年4月18日

Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences

Arxiv

0+阅读 · 2022年4月15日

Tensor Decompositions for temporal knowledge base completion

Arxiv

10+阅读 · 2020年4月10日

Graph Enhanced Representation Learning for News Recommendation

Arxiv

24+阅读 · 2020年3月31日

K-BERT: Enabling Language Representation with Knowledge Graph

K-BERT: Enabling Language Representation with Knowledge Graph

Arxiv

19+阅读 · 2019年9月17日

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Arxiv

16+阅读 · 2019年5月24日

Deep Representation Learning for Domain Adaptation of Semantic Image Segmentation

Arxiv

10+阅读 · 2018年5月10日

微信扫码咨询专知VIP会员