With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.


翻译:随着大型语言模型及其视觉增强版本的出现,众多研究已探讨了它们在融合视觉与语言模态任务中的能力。本研究旨在评估视觉语言模型在扮演高度专业语音学家角色、解读语音声谱图与波形图方面的表现。为此,我们合成了一项新颖数据集,包含超过4000个独立发音的英语单词,并配以风格一致的声谱图与波形图。通过多项选择任务测试视觉语言模型对这些语音表征的理解能力:模型需在给定语音表征时,从三个基于音素编辑距离筛选的干扰转录中,预测正确的音素或字素转录。实验发现,无论是零样本模型还是微调模型,其表现均鲜有超过随机概率,这表明模型需要掌握解读此类图形的特定参数化知识,而非仅依赖配对样本。

0
下载
关闭预览

相关内容

ACM/IEEE第23届模型驱动工程语言和系统国际会议,是模型驱动软件和系统工程的首要会议系列,由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来,模型涵盖了建模的各个方面,从语言和方法到工具和应用程序。模特的参加者来自不同的背景,包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛,参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会,并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。 官网链接:http://www.modelsconference.org/
Top
微信扫码咨询专知VIP会员