HUME：测量文本嵌入任务中人类与模型性能差距 (HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks)

Comparing human and model performance offers a valuable perspective for understanding the strengths and limitations of embedding models, highlighting where they succeed and where they fail to capture meaning and nuance. However, such comparisons are rarely made, as human performance on embedding tasks is difficult to measure. To fill this gap, we introduce HUME: Human Evaluation Framework for Text Embeddings. While frameworks like MTEB provide broad model evaluation, they lack reliable estimates of human performance, limiting the interpretability of model scores. We measure human performance across 16 MTEB datasets spanning reranking, classification, clustering, and semantic textual similarity across linguistically diverse high- and low-resource languages. Humans achieve an average performance of 77.6% compared to 80.1% for the best embedding model, though with substantial variation: models reach high performance on some datasets while struggling on notably low-resource languages. Our human annotations also reveal multiple dataset issues. We additionally benchmark nine LLMs as annotators on reranking, classification, and STS tasks, finding that they fall short of human performance (76.1% vs. 81.2%) despite offering scalability advantages. We provide human performance baselines, insights into task difficulty patterns, and an extensible evaluation framework that enables a more meaningful interpretation of results and informs the development of both models and benchmarks. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

翻译：比较人类与模型的性能为理解嵌入模型的优势与局限提供了宝贵视角，揭示了模型在捕捉语义与细微差别方面的成功与不足。然而，此类比较鲜有进行，因为人类在嵌入任务上的表现难以量化。为填补这一空白，我们提出了HUME：文本嵌入的人类评估框架。尽管MTEB等框架提供了广泛的模型评估，但它们缺乏对人类性能的可靠估计，限制了模型得分的可解释性。我们在涵盖重排序、分类、聚类及语义文本相似性的16个MTEB数据集上测量了人类表现，这些数据集涉及语言多样性的高资源与低资源语言。人类平均性能为77.6%，而最佳嵌入模型为80.1%，但存在显著差异：模型在某些数据集上表现优异，而在明显低资源语言上则表现欠佳。我们的人类标注还揭示了多个数据集问题。此外，我们在重排序、分类和STS任务上对九个大型语言模型作为标注器进行了基准测试，发现尽管其具备可扩展性优势，但性能仍低于人类（76.1%对比81.2%）。我们提供了人类性能基线、任务难度模式的深入分析，以及一个可扩展的评估框架，旨在使结果解释更具意义，并为模型与基准测试的开发提供指导。我们的代码、数据集及排行榜已公开于https://github.com/embeddings-benchmark/mteb。