We introduce MRMR, the first expert-level multidisciplinary multimodal retrieval benchmark requiring intensive reasoning. MRMR contains 1,502 queries spanning 23 domains, with positive documents carefully verified by human experts. Compared to prior benchmarks, MRMR introduces three key advancements. First, it challenges retrieval systems across diverse areas of expertise, enabling fine-grained model comparison across domains. Second, queries are reasoning-intensive, with images requiring deeper interpretation such as diagnosing microscopic slides. We further introduce Contradiction Retrieval, a novel task requiring models to identify conflicting concepts. Finally, queries and documents are constructed as image-text interleaved sequences. Unlike earlier benchmarks restricted to single images or unimodal documents, MRMR offers a realistic setting with multi-image queries and mixed-modality corpus documents. We conduct an extensive evaluation of 4 categories of multimodal retrieval systems and 14 frontier models on MRMR. The text embedding model Qwen3-Embedding with LLM-generated image captions achieves the highest performance, highlighting substantial room for improving multimodal retrieval models. Although latest multimodal models such as Ops-MM-Embedding perform competitively on expert-domain queries, they fall short on reasoning-intensive tasks. We believe that MRMR paves the way for advancing multimodal retrieval in more realistic and challenging scenarios.
翻译:我们提出了MRMR,首个需要密集推理的专家级多学科多模态检索基准。MRMR包含跨越23个领域的1,502个查询,其正例文档均经过人类专家严格验证。与先前基准相比,MRMR引入了三项关键进展。首先,它在多样化的专业领域挑战检索系统,支持跨领域的细粒度模型比较。其次,查询具有推理密集型特性,图像需要更深层次的解读(例如诊断显微切片)。我们进一步引入了矛盾检索这一新颖任务,要求模型识别相互冲突的概念。最后,查询与文档均构建为图文交错的序列。不同于早期局限于单张图像或单模态文档的基准,MRMR通过多图像查询和混合模态语料库文档提供了更真实的设定。我们在MRMR上对4类多模态检索系统和14个前沿模型进行了广泛评估。采用LLM生成图像描述的文本嵌入模型Qwen3-Embedding取得了最高性能,这凸显了多模态检索模型仍有巨大改进空间。尽管最新的多模态模型(如Ops-MM-Embedding)在专业领域查询上表现具有竞争力,但在推理密集型任务中仍存在不足。我们相信MRMR将为在更真实且更具挑战性的场景中推进多模态检索研究开辟道路。