Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 56 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings to date. We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks. MTEB comes with open-source code and a public leaderboard at https://huggingface.co/spaces/mteb/leaderboard.
翻译:文本嵌入通常在一组小的数据集上进行评估,从一个单一的任务到一个没有覆盖其可能应用到其他任务的小型任务。尚不清楚的是,在语义文本相似性(STS)上的最新嵌入是否同样能很好地适用于其他任务,例如集群或重新排序。这使实地的进展难以跟踪,因为各种模型不断提出,而没有适当的评价。为了解决这个问题,我们引入了“大规模文本嵌入基准 ” (MDEB) 。MDEB 覆盖了8个嵌入任务,涵盖总共56个数据集和112种语言。通过对MTEB 的33个模型进行基准,我们建立了迄今为止嵌入文本的最全面的基准。我们发现,没有特定的文本嵌入方法在所有任务中占主导地位。这表明,外地尚未在通用文本嵌入方法上趋同,并且将其扩大到足以为所有嵌入任务提供状态-艺术结果。 MDEB 包含开源代码,并在 https://huingggface.co/spaces/mteb/leaderboardboard 上有一个公共领导板。