Recently, contrastive learning has been shown to be effective in improving pre-trained language models (PLM) to derive high-quality sentence representations. It aims to pull close positive examples to enhance the alignment while push apart irrelevant negatives for the uniformity of the whole representation space. However, previous works mostly adopt in-batch negatives or sample from training data at random. Such a way may cause the sampling bias that improper negatives (e.g. false negatives and anisotropy representations) are used to learn sentence representations, which will hurt the uniformity of the representation space. To address it, we present a new framework \textbf{DCLR} (\underline{D}ebiased \underline{C}ontrastive \underline{L}earning of unsupervised sentence \underline{R}epresentations) to alleviate the influence of these improper negatives. In DCLR, we design an instance weighting method to punish false negatives and generate noise-based negatives to guarantee the uniformity of the representation space. Experiments on seven semantic textual similarity tasks show that our approach is more effective than competitive baselines. Our code and data are publicly available at the link: \textcolor{blue}{\url{https://github.com/RUCAIBox/DCLR}}.
翻译:最近,对比式学习证明在改进培训前语言模型(PLM)以获得高质量的句子表示方式方面是有效的,目的是拉近近积极的范例,以加强校正,同时将不相干负差分分离出来,使整个演示空间的统一性得到统一。然而,以往的工作大多采用随机地从培训数据中抽取负差或抽样。这样可能会造成抽样偏差,认为不适当负差(如假负差和反动表达方式)被用来学习句子表达方式,这会损害演示空间的统一性。为了解决这个问题,我们提出了一个新的框架\ textbf{DCLR}({下线{D}偏差性\下线{C}关于偏差的负差{下线{C}在线}{L}以往的工作大多采用未经监督的句子/下线{RUD{R},以减轻这些不适当负差的影响。在DCLRR,我们设计了一个实例加权方法,以惩罚假负差和产生基于噪音的负差以保证代表空间的统一性。在7个语义文本/RBRU/CRU 上实验性的工作显示我们现有的标准是比竞争性基准。