Research techniques in the last decade have improved lossless compression ratios by significantly increasing processing time. These techniques have remained obscure because production systems require high throughput and low resource utilization. In practice, application-specific compression algorithms that leverage knowledge of the data structure and semantics are more popular. Application-specific compressor systems outperform even the best generic compressors, but these techniques have some drawbacks. Application-specific compressors are inherently limited in applicability, have high development costs, and are difficult to maintain and deploy. In this work, we show that these challenges can be overcome with a new compression strategy. We propose the "graph model" of compression, a new theoretical framework for representing compression as a directed acyclic graph of modular codecs. OpenZL compresses data into a self-describing wire format, any configuration of which can be decompressed by a universal decoder. OpenZL's design enables rapid development of tailored compressors with minimal code; its universal decoder eliminates deployment lag; and its investment in a well-vetted standard component library minimizes security risks. Experimental results demonstrate that OpenZL achieves superior compression ratios and speeds compared to state-of-the-art general-purpose compressors on a variety of real-world datasets. Internal deployments at Meta have also shown consistent improvements in size and/or speed, with development timelines reduced from months to days. OpenZL thus represents a significant advance in practical, scalable, and maintainable data compression for modern data-intensive applications.
翻译:过去十年的研究技术通过显著增加处理时间提高了无损压缩比,但这些技术因生产系统需要高吞吐量和低资源利用率而一直未被广泛应用。在实践中,利用数据结构和语义知识的应用特定压缩算法更为流行。应用特定压缩系统甚至优于最佳通用压缩器,但这些技术存在一些缺点:应用特定压缩器在适用性上固有受限,开发成本高,且难以维护和部署。本研究表明,这些挑战可通过新的压缩策略克服。我们提出压缩的“图模型”,这是一种将压缩表示为模块化编解码器有向无环图的新理论框架。OpenZL将数据压缩为自描述线格式,其任何配置均可由通用解码器解压。OpenZL的设计能以最少代码快速开发定制压缩器;其通用解码器消除了部署延迟;对经过充分验证的标准组件库的投入最小化了安全风险。实验结果表明,在多种真实数据集上,OpenZL相比最先进的通用压缩器实现了更优的压缩比和速度。Meta的内部部署也显示其在体积和/或速度上持续改进,开发周期从数月缩短至数日。因此,OpenZL代表了面向现代数据密集型应用的实际、可扩展且可维护数据压缩的重要进展。