Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.
翻译:语音象征是一种语言学概念,指语音形式与其意义之间的非任意性关联。我们认为,这可以作为探究多模态大语言模型如何解释人类语言中听觉信息的有力切入点。我们研究了MLLMs在文本(正字法和国际音标)与听觉形式输入上的语音象似性表现,涉及多达25个语义维度(如尖锐vs.圆润),并通过测量音素级注意力分数来观察模型的分层信息处理。为此,我们提出了LEX-ICON——一个广泛的拟声词数据集,包含来自四种自然语言(英语、法语、日语和韩语)的8,052个单词及2,930个系统构建的伪词,所有词均标注了适用于文本和音频模态的语义特征。我们的核心发现表明:(1)MLLMs在多个语义维度上展现出与现有语言学研究一致的语音直觉;(2)语音语义注意力模式凸显了对象征性音素的聚焦。这些成果连接了人工智能与认知语言学领域,首次就MLLMs的可解释性对语音象似性进行了大规模量化分析。