FlagEval 发现报告：大型推理模型在自动可验证文本与视觉问题上的初步评估 (FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions)

Bowen Qin,Chen Yue,Fang Yin,Hui Wang,JG Yao,Jiakang Liu,Jing-Shu Zheng,Miguel Hu Chen,Richeng Xuan,Shibei Meng,Shiqi Zhou,Teng Dai,Tong-Shuai Ren,Wei Cui,Xi Yang,Xialin Du,Xiaojing Xu,Xue Sun,Xuejing Li,Yaming Liu,Yesheng Liu,Ying Liu,Yonghua Lin,Yu Zhao,Yunduo Zhang,Yuwen Luo,Zheqi He,Zhiyuan He,Zhongyuan Wang

from arxiv, Project homepage: https://flageval-baai.github.io/LRM-Eval/ This work will also be presented at NeurIPS 2025 Workshop on Foundations of Reasoning in Language Models (FoRLM)

We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/

翻译：我们在一定程度上对当前大型推理模型（LRMs）进行了中等规模的无污染（在某种程度上）评估，并获得了一些初步发现。同时，我们发布了ROME——一个旨在测试基于视觉线索推理能力的视觉语言模型评估基准。我们在以下网站提供了基准、评估数据及其他更新的链接：https://flageval-baai.github.io/LRM-Eval/

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日