Error-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. With ever-emerging heterogeneous HPC architecture, GPU-accelerated error-bounded compressors (such as cuSZ and cuZFP) have been developed. However, they suffer from either low performance or low compression ratios. To this end, we propose cuSZ(x) to target both high compression ratio and throughput. We identify that data sparsity and data smoothness are key factors for high compression throughput. Our key contributions in this work are fourfold: (1) We propose an efficient compression workflow to adaptively perform run-length encoding and/or variable-length encoding. (2) We derive Lorenzo reconstruction in decompression as multidimensional partial-sum computation and propose a fine-grained Lorenzo reconstruction algorithm for GPU architectures. (3) We carefully optimize each of cuSZ's kernels by leveraging state-of-the-art CUDA parallel primitives. (4) We evaluate cuSZ(x) using seven real-world HPC application datasets on V100 and A100 GPUs. Experiments show cuSZ(x) improves the compression performance and ratios by up to 18.4$\times$ and 5.3$\times$, respectively, over cuSZ on the tested datasets.
翻译:与错误相关的损失压缩是大量减少科学数据数量的关键技术。 在不断出现不同的高压聚苯乙烯结构中,已经开发出GPU加速错误压缩器(如 cuSZ 和 cuZFP ) 。 但是,它们有低性能或低压缩率。 为此,我们提议 cuSZ(x) 以高压缩率和吞吐量为目标。 我们确认数据宽度和数据光滑度是高压缩通过量的关键因素。 我们在这方面的主要贡献有四重:(1) 我们提出高效压缩工作流程,以适应性地运行运行长编码和/或变长编码。 (2) 我们以多维部分和计算的方式将洛伦佐的减压重建推算成多维度部分和低压缩率。 但是我们提出微细度的洛伦佐的重建算法,以高压缩比率为目标。 (3) 我们通过利用州级的CUDA平行原始数据来仔细优化每个库。 (4) 我们用七套真实的HPC应用数据集来评估CUSZ 。