Multimodal Large Language Models (LLMs) claim "musical understanding" via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Moreover, we separate three sources of variability: (i) perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples (zero- vs. few-shot manipulations), and (iii) reasoning strategies (Standalone, CoT, LogicLM). For the latter we adapt LogicLM, a framework combining LLMs with symbolic solvers to perform structured reasoning, to music. Results reveal a clear perceptual gap: models perform near ceiling on MIDI but show accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. This is expected for MIDI, where performance reaches saturation, but more surprising for audio, where LogicLM, despite near-perfect MIDI accuracy, remains notably brittle. Among models, Gemini Pro achieves the highest performance across most conditions. Overall, current systems reason well over symbols (MIDI) but do not yet "listen" reliably from audio. Our method and dataset make the perception-reasoning boundary explicit and offer actionable guidance for building robust, audio-first music systems.
翻译:多模态大语言模型(LLMs)声称具备“音乐理解”能力,但其评估方式往往将听觉感知与乐谱阅读混为一谈。本研究对三种最先进的大语言模型(Gemini 2.5 Pro、Gemini 2.5 Flash 和 Qwen2.5-Omni)在三个核心音乐技能上进行了基准测试:切分音评分、移调检测与和弦性质识别。此外,我们分离了三个变异来源:(i)感知局限(音频输入与 MIDI 输入),(ii)示例接触(零样本与少样本操作),以及(iii)推理策略(独立推理、思维链推理、LogicLM)。对于最后一项,我们将 LogicLM——一个将大语言模型与符号求解器结合以执行结构化推理的框架——适配应用于音乐领域。结果显示存在明显的感知鸿沟:模型在 MIDI 输入上表现接近上限,但在音频输入上准确率显著下降。推理策略和少样本提示带来的提升有限。这在 MIDI 上是可以预期的,因为其性能已趋于饱和;但在音频上则更令人意外,因为 LogicLM 尽管在 MIDI 上准确率接近完美,在音频上却依然表现出明显的脆弱性。在模型比较中,Gemini Pro 在大多数条件下取得了最高性能。总体而言,当前系统能够很好地基于符号(MIDI)进行推理,但尚无法可靠地从音频中“聆听”。我们的方法与数据集明确了感知与推理的边界,并为构建稳健的、以音频为首要输入的音乐系统提供了可操作的指导。