Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: $(i)$ Estimate latent variables associated to rows and columns; $(ii)$ Partition the table in blocks according to the row/column latents; $(iii)$ Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; $(iv)$ Append a compressed encoding of the latents. We evaluate it on several benchmark datasets, and study optimal compression in a probabilistic model for that tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state encoders do not achieve this rate. On the other hand, the latent estimation strategy outlined above achieves the optimal rate.
翻译:用于分析和机器学习的数据通常采取带有绝对条目的表格形式。我们采用一系列无损压缩算法,用于分四个步骤进行的数据:美元(一)美元(估计与行和列相关的潜在变量);美元(二)美元(一)按行/柱潜值将表格按区块分隔;美元(三)美元(三)对每个区块应用顺序(例如Lempel-Ziv)编码器;美元(四)美元(四)追加对潜值的压缩编码。我们评估了几个基准数据集,并研究了该表数据概率模型中的最佳压缩法,根据这一模型,潜值是独立的,表条目是有条件独立的;我们证明该模型有一个明确界定的英特比率,并满足了一种无符号装备属性的属性。我们还证明Lempel-Ziv和有限状态编码器等古典缩缩压缩方法没有达到这一速率。另一方面,上文概述的潜值估计战略达到了最佳速率。