随机替代数据 (Data Deduplication with Random Substitutions) - 专知论文

会员服务 ·

0

Performer · 置换 · Storage · 流 · MoDELS ·

2021 年 7 月 1 日

Data Deduplication with Random Substitutions

翻译：随机替代数据

Hao Lou,Farzad Farnoud

Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more time efficient and are thus widely used in large scale storage systems. In this paper, we provide an information-theoretic analysis on the performance of deduplication algorithms on data streams in which repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. More precisely, each symbol in a repeated string is substituted with a given edit probability. Deduplication algorithms in both the fixed-length scheme and the variable-length scheme are studied. The fixed-length deduplication algorithm is shown to be unsuitable for the proposed source model as it does not take into account the edit probability. Two modifications are proposed and shown to have performances within a constant factor of optimal with the knowledge of source model parameters. We also study the conventional variable-length deduplication algorithm and show that as source entropy becomes smaller, the size of the compressed string vanishes relative to the length of the uncompressed string, leading to high compression ratios.

翻译：数据解析通过辨别和删除数据流中的重复来节省存储空间。与传统的压缩方法相比, 数据解析方案更具有时间效率, 因而在大型储存系统中广泛使用。在本文中, 我们对数据流中的重复算法的性能进行了信息理论分析, 而重复并不准确。我们引入了一个参考概率替代的源模型。更准确地说, 重复字符串中的每个符号都被替换为给定的编辑概率。正在研究固定长度方案和可变长度方案中的解析算法。固定长度解算法被显示不适合拟议的源模型, 因为它没有考虑到编辑概率。提出两项修改, 并显示其性能与源模型参数知识的常数最佳性系数一致。我们还研究传统的变量解析算法, 并显示作为源的酶变小, 压缩字符串的消散大小与未压缩字符串长度相对, 导致高压缩比率。

0

相关内容

Performer

【经典书】线性代数元素，197页pdf

【经典书】线性代数元素，197页pdf

专知会员服务

56+阅读 · 2021年3月4日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

Effective.Modern.C++ 中英文版，334页pdf

Effective.Modern.C++ 中英文版，334页pdf

专知会员服务

68+阅读 · 2020年11月4日

【Manning新书】现代Java实战，592页pdf

【Manning新书】现代Java实战，592页pdf

专知会员服务

101+阅读 · 2020年5月22日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务

162+阅读 · 2020年1月16日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

【新书】Python编程基础，669页pdf

【新书】Python编程基础，669页pdf

专知会员服务

196+阅读 · 2019年10月10日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

时序数据异常检测工具/数据集大列表

时序数据异常检测工具/数据集大列表

极市平台

65+阅读 · 2019年2月23日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Ray RLlib: Scalable 降龙十八掌

Ray RLlib: Scalable 降龙十八掌

CreateAMind

9+阅读 · 2018年12月28日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

Python机器学习教程资料/代码

Python机器学习教程资料/代码

机器学习研究会

8+阅读 · 2018年2月22日

【推荐】免费书(草稿)：数据科学的数学基础

【推荐】免费书(草稿)：数据科学的数学基础

机器学习研究会

20+阅读 · 2017年10月1日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

Autoencoders as Tools for Program Synthesis

Arxiv

0+阅读 · 2021年9月5日

On the proof complexity of MCSAT

Arxiv

0+阅读 · 2021年9月3日

Smart Data driven Decision Trees Ensemble Methodology for Imbalanced Big Data

Arxiv

0+阅读 · 2021年9月3日

New efficient time-stepping schemes for the anisotropic phase-field dendritic crystal growth model

Arxiv

0+阅读 · 2021年9月3日

A Support Vector Machine Based Cure Rate Model For Interval Censored Data

Arxiv

0+阅读 · 2021年9月2日

Querying multiple sets of $p$-values

Arxiv

0+阅读 · 2021年9月2日

Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

Arxiv

4+阅读 · 2018年11月5日

Reinforcement Learning for Solving the Vehicle Routing Problem

Arxiv

3+阅读 · 2018年5月21日

Billion-scale Network Embedding with Iterative Random Projection

Arxiv

5+阅读 · 2018年5月7日

Subset Labeled LDA for Large-Scale Multi-Label Classification

Arxiv

3+阅读 · 2017年9月16日

VIP会员

文章信息

相关主题

相关VIP内容

【经典书】线性代数元素，197页pdf

【经典书】线性代数元素，197页pdf

专知会员服务

56+阅读 · 2021年3月4日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

Effective.Modern.C++ 中英文版，334页pdf

Effective.Modern.C++ 中英文版，334页pdf

专知会员服务

68+阅读 · 2020年11月4日

【Manning新书】现代Java实战，592页pdf

【Manning新书】现代Java实战，592页pdf

专知会员服务

101+阅读 · 2020年5月22日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务

162+阅读 · 2020年1月16日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

【新书】Python编程基础，669页pdf

【新书】Python编程基础，669页pdf

专知会员服务

196+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《毁灭算法：解析以色列在加沙的AI军事行动》

【COLT 2025最新教程】语言生成

以机器速度锁定目标：人工智能的能力与局限

【ICML2025】通过在线世界模型规划的持续强化学习

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

时序数据异常检测工具/数据集大列表

时序数据异常检测工具/数据集大列表

极市平台

65+阅读 · 2019年2月23日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Ray RLlib: Scalable 降龙十八掌

Ray RLlib: Scalable 降龙十八掌

CreateAMind

9+阅读 · 2018年12月28日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

Python机器学习教程资料/代码

Python机器学习教程资料/代码

机器学习研究会

8+阅读 · 2018年2月22日

【推荐】免费书(草稿)：数据科学的数学基础

【推荐】免费书(草稿)：数据科学的数学基础

机器学习研究会

20+阅读 · 2017年10月1日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

相关论文

Autoencoders as Tools for Program Synthesis

Arxiv

0+阅读 · 2021年9月5日

On the proof complexity of MCSAT

Arxiv

0+阅读 · 2021年9月3日

Smart Data driven Decision Trees Ensemble Methodology for Imbalanced Big Data

Arxiv

0+阅读 · 2021年9月3日

New efficient time-stepping schemes for the anisotropic phase-field dendritic crystal growth model

Arxiv

0+阅读 · 2021年9月3日

A Support Vector Machine Based Cure Rate Model For Interval Censored Data

Arxiv

0+阅读 · 2021年9月2日

Querying multiple sets of $p$-values

Arxiv

0+阅读 · 2021年9月2日

Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

Arxiv

4+阅读 · 2018年11月5日

Reinforcement Learning for Solving the Vehicle Routing Problem

Arxiv

3+阅读 · 2018年5月21日

Billion-scale Network Embedding with Iterative Random Projection

Arxiv

5+阅读 · 2018年5月7日

Subset Labeled LDA for Large-Scale Multi-Label Classification

Arxiv

3+阅读 · 2017年9月16日

微信扫码咨询专知VIP会员