Lossy compression plays a growing role in scientific simulations where the cost of storing their output data can span terabytes. Using error bounded lossy compression reduces the amount of storage for each simulation; however, there is no known bound for the upper limit on lossy compressibility. Correlation structures in the data, choice of compressor and error bound are factors allowing larger compression ratios and improved quality metrics. Analyzing these three factors provides one direction towards quantifying lossy compressibility. As a first step, we explore statistical methods to characterize the correlation structures present in the data and their relationships, through functional models, to compression ratios. We observed a relationship between compression ratios and statistics summarizing correlation structure of the data, which are a first step towards evaluating the theoretical limits of lossy compressibility used to eventually predict compression performance and adapt compressors to correlation structures present in the data.
翻译:在科学模拟中,损失压缩在存储输出数据的成本可以跨越兆字节的科学模拟中发挥着越来越大的作用。 使用错误约束损失压缩可以减少每次模拟的存储量; 但是,对于损失压缩的上限没有已知的界限; 数据中的关联结构、 压缩机的选择和错误约束是允许较大压缩比率和改进质量指标的因素。 分析这三个因素为量化损失压缩提供了一个方向。 作为第一步,我们探索统计方法,通过功能模型来描述数据中存在的相关结构及其与压缩比率的关系。 我们观察到压缩比率与概述数据相关结构的统计之间的关系,这是评估最终预测压缩性表现和使压缩机适应数据中存在的相关结构所使用的损失压缩性理论限制的第一步。