The phenomenon of sound symbolism, the non-arbitrary mapping between word sounds and meanings, has long been demonstrated through anecdotal experiments like Bouba Kiki, but rarely tested at scale. We present the first computational cross-linguistic analysis of sound symbolism in the semantic domain of size. We compile a typologically broad dataset of 810 adjectives (27 languages, 30 words each), each phonemically transcribed and validated with native-speaker audio. Using interpretable classifiers over bag-of-segment features, we find that phonological form predicts size semantics above chance even across unrelated languages, with both vowels and consonants contributing. To probe universality beyond genealogy, we train an adversarial scrubber that suppresses language identity while preserving size signal (also at family granularity). Language prediction averaged across languages and settings falls below chance while size prediction remains significantly above chance, indicating cross-family sound-symbolic bias. We release data, code, and diagnostic tools for future large-scale studies of iconicity.
翻译:音义联觉现象——即词汇语音与意义之间的非任意映射关系——长期以来通过诸如'Bouba Kiki'等轶事性实验得以验证,但鲜有大规模系统性检验。本研究首次针对尺寸语义域的音义联觉现象开展计算语言学跨语言分析。我们构建了一个类型学覆盖面广的数据集,包含810个形容词(涵盖27种语言,每种语言30个词汇),每个词汇均经过音位转写并由母语者音频验证。通过基于词袋音段特征的可解释分类器,我们发现即使在不相关语言之间,语音形式对尺寸语义的预测能力仍显著高于随机水平,其中元音与辅音均具有贡献。为探究超越谱系关系的普遍性,我们训练了一种对抗性擦除器,在抑制语言身份信息的同时保留尺寸语义信号(支持语系粒度控制)。跨语言与多场景的平均语言预测准确率降至随机水平以下,而尺寸预测准确率仍显著高于随机水平,这揭示了跨语系的音义联觉偏向。我们公开了数据集、代码与诊断工具,以支持未来大规模象似性研究。