用于科学数据压缩的可缩放混合学习技术 (Scalable Hybrid Learning Techniques for Scientific Data Compression)

Data compression is becoming critical for storing scientific data because many scientific applications need to store large amounts of data and post process this data for scientific discovery. Unlike image and video compression algorithms that limit errors to primary data, scientists require compression techniques that accurately preserve derived quantities of interest (QoIs). This paper presents a physics-informed compression technique implemented as an end-to-end, scalable, GPU-based pipeline for data compression that addresses this requirement. Our hybrid compression technique combines machine learning techniques and standard compression methods. Specifically, we combine an autoencoder, an error-bounded lossy compressor to provide guarantees on raw data error, and a constraint satisfaction post-processing step to preserve the QoIs within a minimal error (generally less than floating point error). The effectiveness of the data compression pipeline is demonstrated by compressing nuclear fusion simulation data generated by a large-scale fusion code, XGC, which produces hundreds of terabytes of data in a single day. Our approach works within the ADIOS framework and results in compression by a factor of more than 150 while requiring only a few percent of the computational resources necessary for generating the data, making the overall approach highly effective for practical scenarios.

翻译：与限制原始数据误差的图像和视频压缩算法不同,科学家需要压缩技术,以准确保存利益衍生量(QoIs)。本文介绍了一种物理知情压缩技术,作为满足这一要求的终端到终端、可扩展、基于GPU的数据压缩管道。我们的混合压缩技术结合了机器学习技术和标准压缩方法。具体地说,我们结合了一个自动编码器、一个受错误限制的丢失压缩压缩器,以提供原始数据错误的保证,以及一个将QoIs保存在最小错误(一般少于浮动点错误)中的制约性处理后满意度步骤。数据压缩管道的有效性表现在对大规模聚变码XGC产生的核聚变模拟数据进行压缩,在一天内产生数百兆字节的数据。我们的方法在ADIOS框架内进行工作,结果压缩系数超过150,同时只需要少量的计算资源,才能产生高效益的整体数据。