Self-supervised models for speech processing form representational spaces without using any external labels. Increasingly, they appear to be a feasible way of at least partially eliminating costly manual annotations, a problem of particular concern for low-resource languages. But what kind of representational spaces do these models construct? Human perception specializes to the sounds of listeners' native languages. Does the same thing happen in self-supervised models? We examine the representational spaces of three kinds of state-of-the-art self-supervised models: wav2vec 2.0, HuBERT and contrastive predictive coding (CPC), and compare them with the perceptual spaces of French-speaking and English-speaking human listeners, both globally and taking account of the behavioural differences between the two language groups. We show that the CPC model shows a small native language effect, but that wav2vec 2.0 and HuBERT seem to develop a universal speech perception space which is not language specific. A comparison against the predictions of supervised phone recognisers suggests that all three self-supervised models capture relatively fine-grained perceptual phenomena, while supervised models are better at capturing coarser, phone-level, effects of listeners' native language, on perception.
翻译:在不使用任何外部标签的情况下,语音处理代表空间的自我监督模式不使用任何外部标签。 越来越多的是,它们似乎是至少部分消除成本昂贵的人工说明的可行方法,这是低资源语言特别关注的一个问题。 但是,这些模式可以构建何种代表空间? 人类的认知专门针对听众的母语声音。 同样的事情是否发生在自我监督的模型中? 我们检查三种最先进的自我监督模式: wav2vec 2. 0、HuBERT 和对比性预测编码(CPC) 的代表性空间,并将其与讲法语和英语的人类听众的认知空间进行比较, 包括全球范围, 以及考虑到两种语言群体的行为差异。 我们显示, CPC 模式显示了一种小的本地语言效应, 但是, wav2vec 2.0 和 HuBERT 似乎可以开发一种通用的语音认知空间, 它不是语言专用的。 与受监督的电话识别器的预测相比, 所有三种自我监督的模型都能够捕捉取相对精细的视觉视觉现象, 而受监督的模型则能够更好地捕捉取对本地语言影响。