With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.
翻译:随着大型语言模型及其视觉增强版本的出现,众多研究已探讨了它们在融合视觉与语言模态任务中的能力。本研究旨在评估视觉语言模型在扮演高度专业语音学家角色、解读语音声谱图与波形图方面的表现。为此,我们合成了一项新颖数据集,包含超过4000个独立发音的英语单词,并配以风格一致的声谱图与波形图。通过多项选择任务测试视觉语言模型对这些语音表征的理解能力:模型需在给定语音表征时,从三个基于音素编辑距离筛选的干扰转录中,预测正确的音素或字素转录。实验发现,无论是零样本模型还是微调模型,其表现均鲜有超过随机概率,这表明模型需要掌握解读此类图形的特定参数化知识,而非仅依赖配对样本。