Despite profound successes, contrastive representation learning relies on carefully designed data augmentations using domain specific knowledge. This challenge is magnified in natural language processing where no general rules exist for data augmentation due to the discrete nature of natural language. We tackle this challenge by presenting a Virtual augmentation Supported Contrastive Learning of sentence representations (VaSCL). Originating from the interpretation that data augmentation essentially constructs the neighborhoods of each training instance, we in turn utilize the neighborhood to generate effective data augmentations. Leveraging the large training batch size of contrastive learning, we approximate the neighborhood of an instance via its K-nearest in-batch neighbors in the representation space. We then define an instance discrimination task within this neighborhood, and generate the virtual augmentation in an adversarial training manner. We access the performance of VaSCL on a wide range of downstream tasks, and set a new state-of-the-art for unsupervised sentence representation learning.
翻译:尽管取得了巨大成功,但对比式代表性学习依赖于利用领域特定知识精心设计的数据增强数据。由于自然语言的离散性质,在自然语言处理中没有关于数据增强的一般性规则的情况下,这一挑战在自然语言处理中更加突出。我们通过对判决表述进行虚拟增强支持性反向学习来应对这一挑战。我们从数据增强主要构建每个培训实例的邻里的解释出发,我们反过来利用邻里来产生有效的数据增强。我们利用大型培训批量的对比性学习,通过代表空间中最早期的K型近邻来比较一个实例的邻里。我们随后界定了这一邻里的一个实例歧视任务,并以对抗性培训的方式产生虚拟增强。我们利用VaSCL在一系列下游任务上的功能,并设置了一个新的状态,用于不受监督的句面学习。