Learning sentence embeddings often requires a large amount of labeled data. However, for most tasks and domains, labeled data is seldom available and creating it is expensive. In this work, we present a new state-of-the-art unsupervised method based on pre-trained Transformers and Sequential Denoising Auto-Encoder (TSDAE) which outperforms previous approaches by up to 6.4 points. It can achieve up to 93.1% of the performance of in-domain supervised approaches. Further, we show that TSDAE is a strong domain adaptation and pre-training method for sentence embeddings, significantly outperforming other approaches like Masked Language Model. A crucial shortcoming of previous studies is the narrow evaluation: Most work mainly evaluates on the single task of Semantic Textual Similarity (STS), which does not require any domain knowledge. It is unclear if these proposed methods generalize to other domains and tasks. We fill this gap and evaluate TSDAE and other recent approaches on four different datasets from heterogeneous domains.
翻译:学习句嵌入往往需要大量标签数据。 但是,对于大多数任务和领域来说,标签数据很少可用,因此创建数据费用昂贵。 在这项工作中,我们提出了一个基于预先培训的变异器和序列脱氧自动编码器(TSDAE)的新型先进且不受监督的方法(TSDAE),该方法以6.4点为顶峰,优于以往方法。它可以达到内域监督方法绩效的93.1%。此外,我们还表明,TSDAE是一种很强的域域性适应和预培训方法,大大优于隐蔽语言模型等其他方法。 以往研究的一个重要缺点是狭义评价:大多数工作主要评价单项任务,即不要求任何领域知识的精通性文本相似性(STS),不清楚这些拟议方法是否概括到其他领域和任务。我们填补了这一空白,并评价了四套不同领域的TSDADEE和其他最新方法。