Research in general-purpose lossless compression over the last decade has largely found improvements in compression ratio that come at great cost to resource utilization and processing throughput. However, most production workloads require high throughput and low resource utilization, so most research systems have seen little adoption. Instead, real world improvements in compression are increasingly often realized by building application-specific compressors which can exploit knowledge about the structure and semantics of the data being compressed. These systems easily outperform even the best generic compressors, but application-specific compression schemes are not without drawbacks. They are inherently limited in applicability and are difficult to maintain and deploy. We show that these challenges can be overcome with a new way of thinking about compression. We propose the ``graph model'' of compression, a new theoretical framework for representing compression as a directed acyclic graph of modular codecs. This motivates OpenZL, an implementation of this model that compresses data into a self-describing wire format, any configuration of which can be decompressed by a universal decoder. OpenZL's design enables rapid development of tailored compressors with minimal code, its universal decoder eliminates deployment lag, and its investment in a well-vetted standard component library minimizes security risks. Experimental results demonstrate that OpenZL achieves superior compression ratios and speeds compared to state-of-the-art general-purpose compressors on a variety of real-world datasets. Internal deployments at Meta have also shown consistent improvements in size and/or speed, with development timelines reduced from months to days. OpenZL thus represents an advance in practical, scalable, and maintainable data compression for modern data-intensive applications.
翻译:过去十年中,通用无损压缩的研究主要聚焦于压缩比的提升,但这往往以资源利用率和处理吞吐量为巨大代价。然而,大多数生产负载要求高吞吐量和低资源利用率,因此大多数研究系统鲜有实际应用。相反,现实世界中压缩性能的提升越来越多地通过构建特定应用压缩器来实现,这类压缩器能够利用被压缩数据的结构和语义知识。这些系统即使与最优的通用压缩器相比也表现出显著优势,但特定应用压缩方案并非没有缺点:其适用范围存在固有局限,且难以维护和部署。我们证明,通过一种新的压缩思维方式可以克服这些挑战。我们提出压缩的“图模型”——一种将压缩表示为模块化编解码器有向无环图的新理论框架。基于此,我们实现了OpenZL系统,该系统将数据压缩为自描述的有线格式,其任意配置均可通过通用解码器进行解压。OpenZL的设计使得定制化压缩器能够以最小代码量快速开发,其通用解码器消除了部署延迟,且经过严格验证的标准组件库极大降低了安全风险。实验结果表明,在多种真实数据集上,OpenZL相比最先进的通用压缩器实现了更优的压缩比与速度。Meta内部部署案例也显示其在压缩体积和/或速度方面持续改进,开发周期从数月缩短至数日。因此,OpenZL代表了面向现代数据密集型应用的实用、可扩展且可维护的数据压缩技术的进步。