The model's ability to understand synonymous expression is crucial in many kinds of downstream tasks. It will make the model to better understand the similarity between context, and more robust to the synonym substitution attack. However, many Pretrained Language Model (PLM) lack synonym knowledge due to limitation of small-scale synsets and PLM's pretraining objectives. In this paper, we propose a framework called Sem4SAP to mine synsets from Open Knowledge Graph (Open-KG) and using the mined synsets to do synonym-aware pretraining for language models. We propose to coarsly filter the content in Open-KG and use the frequency information to better help the clustering process under low-resource unsupervised conditions. We expand the mined synsets by migrating core semantics between synonymous expressions.We also propose two novel and effective synonym-aware pre-training methods for injecting synonym knowledge into PLMs.Extensive experiments demonstrate that Sem4SAP can dramatically outperform the original PLMs and other baselines on ten different tasks.
翻译:模型理解同义表达式的能力对于许多下游任务是至关重要的。它将使模型更好地理解上下文之间的相似性,并对同义词替换攻击更加鲁棒。然而,许多预训练语言模型(PLM)由于小型同义词集和PLM的预训练目标的限制而缺乏同义知识。因此,本文提出一种名为Sem4SAP的框架,该框架可从开放知识图谱(Open-KG)中挖掘同义词集,并使用挖掘到的同义词集用于模型的同义词感知预训练。我们提出了对Open-KG中的内容进行粗筛选并使用频率信息来更好地帮助小资源无监督条件下的聚类过程。我们通过迁移同义表达式之间的核心语义来扩展挖掘到的同义词集。我们还提出了两种新颖有效的为PLMs注入同义知识的同义词感知预训练方法。广泛实验表明,Sem4SAP可以在10个不同的任务上比原始PLMs和其他基线显著提高性能。