Error-bounded lossy compression is becoming more and more important to today's extreme-scale HPC applications because of the ever-increasing volume of data generated because it has been widely used in in-situ visualization, data stream intensity reduction, storage reduction, I/O performance improvement, checkpoint/restart acceleration, memory footprint reduction, etc. Although many works have optimized ratio, quality, and performance for different error-bounded lossy compressors, there is none of the existing works attempting to systematically understand the impact of lossy compression errors on HPC application due to error propagation. In this paper, we propose and develop a lossy compression fault injection tool, called LCFI. To the best of our knowledge, this is the first fault injection tool that helps both lossy compressor developers and users to systematically and comprehensively understand the impact of lossy compression errors on HPC programs. The contributions of this work are threefold: (1) We propose an efficient approach to inject lossy compression errors according to a statistical analysis of compression errors for different state-of-the-art compressors. (2) We build a fault injector which is highly applicable, customizable, easy-to-use in generating top-down comprehensive results, and demonstrate the use of LCFI. (3) We evaluate LCFI on four representative HPC benchmarks with different abstracted fault models and make several observations about error propagation and their impacts on program outputs.
翻译:虽然许多工程优化了不同误差损失压缩机的比例、质量和性能,但目前没有一项工作试图系统地理解因错误传播而导致的压缩误差对高常委会应用的影响。在本文件中,我们提议并开发了一个损失压缩错误注入工具,称为LCFI。据我们所知,这是第一个错误注入工具,帮助损失压缩机开发商和用户系统、全面地理解损失压缩错误对高常委会程序的影响。这项工作的贡献有三:(1) 我们建议采用有效方法,根据不同状态压缩错误统计分析法,将损失压缩误差注入高常委会应用中。(2) 我们建议并开发一个损失压缩错误注入工具,称为LCFI。根据我们的知识,这是第一个错误注入工具,帮助损失压缩机开发商和用户系统、全面理解高压压缩机对高压压缩机程序的影响。