Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.
翻译:在语义搜索和计算文本相似性等许多应用程序中,文本嵌入是有用的特征。 以往的工作通常会为不同使用案例定制模型, 不同的数据集选择、 培训目标和模型结构等不同。 在此工作中, 我们显示对无监督的大规模数据进行对比性前培训, 导致文本和代码的矢量表述质量高。 同样未经监督的文本嵌入, 实现线性分类中新的最新状态结果的文本嵌入, 也显示了令人印象深刻的语义搜索能力, 有时甚至以微调模型进行竞争。 在平均超过7项任务的线性概率分类精度方面, 我们的最佳未经监督的模型比以往最佳未经监督和监管的嵌入模式分别实现了4%和1.8的相对改进。 在对大规模语义搜索进行评估时, 同样的文本嵌入也相对改进了23.4%、 14.7%和10.6%的文本嵌入, 分别是MSMARCO、 自然问题 和 TriviaQA 基准, 。 在文本嵌入中, 我们的代码嵌入模型(文本、 相对代码) 之前的搜索队列了20.8 。