Contemporary large language models (LLMs) predominantly utilize a next-token prediction method for inference, which significantly impedes their processing speed. In this paper, we introduce a novel inference methodology termed next-sentence prediction, aimed at enhancing the inference efficiency of LLMs. We present SentenceVAE, a tiny model consisting of an encoder and a decoder. The encoder effectively condenses the information within a sentence into a singular token, while the decoder reconstructs this compressed data back into its original sentential form. By integrating SentenceVAE into the input and output layers of LLMs, we develop Sentence-level LLMs (SLLMs) that employ a sentence-by-sentence inference approach, markedly accelerating inference speeds. SentenceVAE also maintains the integrity of the original semantic content by segmenting the text into sentences, thereby preserving accuracy while boosting inference speeds. Compared to traditional LLMs, SLLMs process fewer tokens over equivalent context lengths, significantly reducing memory demands for Self-Attention computations and facilitating the handling of longer contexts. Our experimental findings reveal that this method can increase inference speeds by 204~365%, reduce perplexity (PPL) to 46~75% of its original metric, and decrease memory overhead by 86~91% for the same context length. The advantages of this approach are further amplified with increases in model parameters.
翻译:暂无翻译