Text representation plays a critical role in tasks like clustering, retrieval, and other downstream applications. With the emergence of large language models (LLMs), there is increasing interest in harnessing their capabilities for this purpose. However, most of the LLMs are inherently causal and optimized for next-token prediction, making them suboptimal for producing holistic representations. To address this, recent studies introduced pretext tasks to adapt LLMs for text representation. Most of these tasks, however, rely on token-level prediction objectives, such as the masked next-token prediction (MNTP) used in LLM2Vec. In this work, we explore the untapped potential of context compression as a pretext task for unsupervised adaptation of LLMs. During compression pre-training, the model learns to generate compact memory tokens, which substitute the whole context for downstream sequence prediction. Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations, outperforming models trained with token-level pretext tasks. Further improvements through contrastive learning produce a strong representation model (LLM2Comp) that outperforms contemporary LLM-based text encoders on a wide range of tasks while being more sample-efficient, requiring significantly less training data.
翻译:文本表示在聚类、检索及其他下游应用中起着关键作用。随着大型语言模型(LLMs)的出现,利用其能力进行文本表示的兴趣日益增长。然而,大多数LLMs本质上是因果性的,并针对下一词元预测进行了优化,这使得它们在生成整体表示方面表现欠佳。为解决这一问题,近期研究引入了预训练任务来使LLMs适应文本表示。但这些任务大多依赖于词元级预测目标,例如LLM2Vec中使用的掩码下一词元预测(MNTP)。在本研究中,我们探索了上下文压缩作为无监督适应LLMs的预训练任务的未开发潜力。在压缩预训练过程中,模型学习生成紧凑的记忆词元,这些词元在下游序列预测中替代整个上下文。实验表明,设计良好的压缩目标能显著提升基于LLM的文本表示效果,优于使用词元级预训练任务训练的模型。通过对比学习实现的进一步改进,产生了一个强大的表示模型(LLM2Comp),该模型在广泛任务中超越了当前基于LLM的文本编码器,同时具有更高的样本效率,所需训练数据显著减少。