We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
翻译:我们在一定程度上对当前大型推理模型(LRMs)进行了中等规模的无污染(在某种程度上)评估,并获得了一些初步发现。同时,我们发布了ROME——一个旨在测试基于视觉线索推理能力的视觉语言模型评估基准。我们在以下网站提供了基准、评估数据及其他更新的链接:https://flageval-baai.github.io/LRM-Eval/