Large language models (LLMs) have demonstrated impressive performance in text generation tasks; however, their embedding spaces often suffer from the isotropy problem, resulting in poor discrimination of domain-specific terminology, particularly in legal and financial contexts. This weakness in terminology-level representation can severely hinder downstream tasks such as legal judgment prediction or financial risk analysis, where subtle semantic distinctions are critical. To address this problem, we propose TermGPT, a multi-level contrastive fine-tuning framework designed for terminology adaptation. We first construct a sentence graph to capture semantic and structural relations, and generate semantically consistent yet discriminative positive and negative samples based on contextual and topological cues. We then devise a multi-level contrastive learning approach at both the sentence and token levels, enhancing global contextual understanding and fine-grained terminology discrimination. To support robust evaluation, we construct the first financial terminology dataset derived from official regulatory documents. Experiments show that TermGPT outperforms existing baselines in term discrimination tasks within the finance and legal domains.
翻译:大型语言模型(LLM)在文本生成任务中展现出卓越性能,但其嵌入空间常受各向同性问题的困扰,导致对领域特定术语——尤其是在法律与金融语境下——的区分能力不足。这种术语级表征的缺陷会严重阻碍下游任务(如法律判决预测或金融风险分析)的性能,因为此类任务对细微语义差异极为敏感。为解决该问题,我们提出TermGPT——一种专为术语适应设计的多层次对比微调框架。我们首先构建句子图以捕捉语义与结构关系,并基于上下文与拓扑线索生成语义一致且具有区分度的正负样本。随后,我们在句子与词元两个层级设计多层次对比学习方法,以增强全局语境理解与细粒度术语区分能力。为支持稳健评估,我们构建了首个基于官方监管文档的金融术语数据集。实验表明,TermGPT在金融与法律领域的术语区分任务中显著优于现有基线方法。