More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, and cuSZ is a version of SZ designed to take advantage of the GPU's power. At present, cuSZ's compression performance has been optimized significantly while its decompression still suffers considerably lower performance because of its sophisticated lossless compression step -- a customized Huffman decoding. In this work, we aim to significantly improve the Huffman decoding performance for cuSZ, thus improving the overall decompression performance in turn. To this end, we first investigate two state-of-the-art GPU Huffman decoders in depth. Then, we propose a deep architectural optimization for both algorithms. Specifically, we take full advantage of CUDA GPU architectures by using shared memory on decoding/writing phases, online tuning the amount of shared memory to use, improving memory access patterns, and reducing warp divergence. Finally, we evaluate our optimized decoders on an Nvidia V100 GPU using eight representative scientific datasets. Our new decoding solution obtains an average speedup of 3.64X over cuSZ's Huffman decoder and improves its overall decompression performance by 2.43X on average.
翻译:越来越多的 HPC 应用程序需要快速有效的压缩技术来处理大量存储和传输的数据。 这些应用程序不仅需要在模拟期间有效压缩数据, 还需要在模拟期间有效压缩数据。 SZ 是科学数据中一个错误的损耗压缩压缩器, 而 cuSZ 是SZ的版本, 旨在利用 GPU的电力。 目前, CUSZ的压缩性能已经大大优化, 而它的降压仍因其精密的无损压缩步骤 -- -- 一个定制的Huffman 解码器。 在这项工作中, 我们的目标是大幅提高Huffman 解码性能, 以便进行超时分析。 为了达到这个目的, 我们首先调查SZZ的两种状态, 目的是利用GPU Huffman 解压缩功能。 然后, 我们提议对这两种算法进行深层次的建筑优化。 具体地说, 我们充分利用CUDA GPUS 的GPS 架构, 利用共同的解码/ 撰写阶段的记忆, 在线调整 CUDS 的数值, 使用GUDr de dal de dal deal de dal de dal de disal 数据, 改进我们的内存数据模式。