遗忘比特，聚焦于令牌：面向大语言模型的语义信息理论 (Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs)

Large language models (LLMs) have demonstrated remarkable capabilities in numerous real-world applications. While the vast majority of research conducted from an experimental perspective is progressing rapidly, it demands substantial computational power, data, and other resources. Therefore, how to open the black-box of LLMs from a theoretical standpoint has become a critical challenge. This paper takes the theory of rate-distortion function, directed information, and Granger causality as its starting point to investigate the information-theoretic principles behind LLMs, leading to the development of semantic information theory for LLMs, where the fundamental unit is token, rather than bits that lacks any semantic meaning. By defining the probabilistic model of LLMs, we discuss structure-agnostic information-theoretic measures, such as the directed rate-distortion function in pre-training, the directed rate-reward function in post-training, and the semantic information flow in inference phase. This paper also delves deeply into the theory of token-level semantic embedding and the information-theoretically optimal vectorization method. Thereafter, we propose a general definition of autoregression LLM, where the Transformer architecture and its performance such as ELBO, generalization error bound, memory capacity, and semantic information measures can be derived theoretically. Other architectures, such as Mamba/Mamba2 and LLaDA, are also discussed in our framework. Consequently, this paper provides a theoretical framework for understanding LLMs from the perspective of semantic information theory, which also offers the necessary theoretical tools for further in-depth research.

翻译：大语言模型（LLMs）在众多实际应用中展现出卓越的能力。尽管从实验视角开展的研究正迅速推进，但其需要大量的计算能力、数据及其他资源。因此，如何从理论层面揭开LLMs的黑箱已成为一个关键挑战。本文以率失真函数理论、定向信息与格兰杰因果性为出发点，探究LLMs背后的信息论原理，进而发展出面向LLMs的语义信息理论，其基本单元是令牌，而非缺乏任何语义含义的比特。通过定义LLMs的概率模型，我们讨论了结构无关的信息论度量，例如预训练中的定向率失真函数、后训练中的定向率奖励函数，以及推理阶段的语义信息流。本文还深入探讨了令牌级语义嵌入理论及信息论最优向量化方法。随后，我们提出了自回归LLM的一般定义，其中Transformer架构及其性能指标，如证据下界（ELBO）、泛化误差界、记忆容量和语义信息度量，均可从理论上推导得出。其他架构，如Mamba/Mamba2和LLaDA，也在我们的框架中得到讨论。因此，本文为从语义信息理论视角理解LLMs提供了一个理论框架，同时也为后续深入研究提供了必要的理论工具。