Music language models (Music LMs), like vision language models, leverage multimodal representations to answer natural language queries about musical audio recordings. Although Music LMs are reportedly improving, we find that current evaluations fail to capture whether their answers are correct. Specifically, for all Music LMs that we examine, widely-used evaluation metrics such as BLEU, METEOR, and BERTScore fail to measure anything beyond linguistic fluency of the model's responses. To measure the true performance of Music LMs, we propose (1) a better general-purpose evaluation metric for Music LMs adapted to the music domain and (2) a factual evaluation framework to quantify the correctness of a Music LM's responses. Our framework is agnostic to the modality of the question-answering model and could be generalized to quantify performance in other open-ended question-answering domains. We use open datasets in our experiments and will release all code on publication.
翻译:音乐语言模型(Music LMs)与视觉语言模型类似,利用多模态表示来回答关于音乐音频记录的自然语言查询。尽管据报道音乐语言模型正在不断改进,但我们发现当前的评估方法未能捕捉其答案是否正确。具体而言,对于我们检查的所有音乐语言模型,广泛使用的评估指标如BLEU、METEOR和BERTScore仅能衡量模型回答的语言流畅性,而无法评估其内容准确性。为了衡量音乐语言模型的真实性能,我们提出(1)一种适用于音乐领域的改进通用评估指标,以及(2)一个事实性评估框架,用于量化音乐语言模型回答的正确性。我们的框架与问答模型的模态无关,并可推广至其他开放式问答领域以量化性能。我们在实验中使用了开放数据集,并将在发表时公开所有代码。