基于判决相似性合成数据识别的重加权战略 (Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity)

Semantically meaningful sentence embeddings are important for numerous tasks in natural language processing. To obtain such embeddings, recent studies explored the idea of utilizing synthetically generated data from pretrained language models (PLMs) as a training corpus. However, PLMs often generate sentences much different from the ones written by human. We hypothesize that treating all these synthetic examples equally for training deep neural networks can have an adverse effect on learning semantically meaningful embeddings. To analyze this, we first train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences. Based on this, we propose a novel approach that first trains the classifier to measure the importance of each sentence. The distilled information from the classifier is then used to train a reliable sentence embedding model. Through extensive evaluation on four real-world datasets, we demonstrate that our model trained on synthetic data generalizes well and outperforms the existing baselines. Our implementation is publicly available at https://github.com/ddehun/coling2022_reweighting_sts.

翻译：为了获得这种嵌入,最近的研究探索了将预先培训的语言模型(PLM)中合成生成的数据用作培训材料的想法。然而,PLM往往产生与人类撰写的非常不同的句子。我们假设,为训练深神经网络而同等对待所有这些合成示例,会对学习具有内在意义的嵌入系统产生不利的影响。为了分析这一点,我们首先培训一个分类器,该分类器可识别机写的句子,并观察到机器所确定句子的语言特征与人写的句子有很大不同。基于这一点,我们提出了一个新颖的方法,首先培训分类器以衡量每一句子的重要性。从分类器中提取的信息随后被用于训练可靠的句子嵌入模型。通过对四个真实世界数据集的广泛评价,我们证明我们所培训的合成数据模型能够很好和超越现有基准。我们的实施过程在 https://github.com/dhun/coling20_restrigring_stryst.