The traditional methods for data compression are typically based on the symbol-level statistics, with the information source modeled as a long sequence of i.i.d. random variables or a stochastic process, thus establishing the fundamental limit as entropy for lossless compression and as mutual information for lossy compression. However, the source (including text, music, and speech) in the real world is often statistically ill-defined because of its close connection to human perception, and thus the model-driven approach can be quite suboptimal. This study places careful emphasis on English text and exploits its semantic aspect to enhance the compression efficiency further. The main idea stems from the puzzle crossword, observing that the hidden words can still be precisely reconstructed so long as some key letters are provided. The proposed masking-based strategy resembles the above game. In a nutshell, the encoder evaluates the semantic importance of each word according to the semantic loss and then masks the minor ones, while the decoder aims to recover the masked words from the semantic context by means of the Transformer. Our experiments show that the proposed semantic approach can achieve much higher compression efficiency than the traditional methods such as Huffman code and UTF-8 code, while preserving the meaning in the target text to a great extent.
翻译:传统的数据压缩方法通常基于符号级统计,信息源被建模为长序列的独立同分布随机变量或随机过程,因此在无损压缩和有损压缩中都以熵为基本限制,以互信息为基本限制。然而,在实际世界中,源(包括文本、音乐和语音)由于与人类感知的密切联系而经常处于统计上不确定的状态,因此基于模型的方法可能非常次优。本研究仔细强调英文文本,并利用其语义方面进一步提高压缩效率。主要思想源于纵横填字游戏,观察到只要提供一些关键的字母,就可以精确地重建隐藏的单词。所提出的基于掩码的策略类似于上述游戏。简而言之,编码器根据语义损失评估每个单词的语义重要性,然后掩盖次要的单词,而解码器则旨在利用Transformer从语义上下文中恢复掩盖的单词。我们的实验表明,所提出的语义方法可以比传统方法(如Huffman编码和UTF-8编码)实现更高的压缩效率,同时在很大程度上保留目标文本的含义。