Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, including retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. We then describe advanced roles enabled by PLMs, including multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.
翻译:文本嵌入因其在检索、分类、聚类、双语文本挖掘和摘要等多种自然语言处理(NLP)任务中的有效性而受到日益增长的关注。随着预训练语言模型(PLMs)的出现,通用文本嵌入(GPTE)因其能够生成丰富且可迁移的表示而获得显著关注。GPTE的通用架构通常利用PLMs来获取稠密文本表示,随后通过在大规模成对数据集上进行对比学习来优化这些表示。本综述全面概述了PLMs时代下的GPTE,重点关注PLMs在其发展中所起的作用。我们首先探讨了基本架构,并描述了PLMs在GPTE中的基础角色,即嵌入提取、表达能力增强、训练策略、学习目标及数据构建。接着,我们阐述了PLMs所支持的进阶角色,包括多语言支持、多模态集成、代码理解以及特定场景适应。最后,我们强调了超越传统改进目标的潜在未来研究方向,包括排序集成、安全性考量、偏见缓解、结构信息融合以及嵌入的认知扩展。本综述旨在为寻求理解GPTE当前状态与未来潜力的新进学者和资深研究者提供有价值的参考。