Producing the embedding of a sentence in an unsupervised way is valuable to natural language matching and retrieval problems in practice. In this work, we conduct a thorough examination of pretrained model based unsupervised sentence embeddings. We study on four pretrained models and conduct massive experiments on seven datasets regarding sentence semantics. We have there main findings. First, averaging all tokens is better than only using [CLS] vector. Second, combining both top andbottom layers is better than only using top layers. Lastly, an easy whitening-based vector normalization strategy with less than 10 lines of code consistently boosts the performance.
翻译:以不受监督的方式将句子嵌入,对于自然语言的匹配和检索在实践中存在问题很有价值。 在这项工作中,我们彻底检查了以未经监督的句子嵌入为基础的预培训模型。 我们研究了四个预培训模型,对7个关于判决语义的数据集进行了大规模实验。 我们得出了主要结论。 首先,平均所有符号比仅仅使用 [CLS] 矢量要好。 其次,将上层和下层结合起来比仅使用顶层要好。 最后,基于白化的矢量正常化战略,其代码线小于10行,持续提升了性能。