Unsupervised speech representations have taken off, with benchmarks (SUPERB, ZeroSpeech) demonstrating major progress on semi-supervised speech recognition, speech synthesis, and speech-only language modelling. Inspiration comes from the promise of ``discovering the phonemes'' of a language or a similar low-bitrate encoding. However, one of the critical properties of phoneme transcriptions is context-invariance: the phonetic context of a speech sound can have massive influence on the way it is pronounced, while the text remains stable. This is what allows tokens of the same word to have the same transcriptions -- key to language understanding. Current benchmarks do not measure context-invariance. We develop a new version of the ZeroSpeech ABX benchmark that measures context-invariance, and apply it to recent self-supervised representations. We demonstrate that the context-independence of representations is predictive of the stability of word-level representations. We suggest research concentrate on improving context-independence of self-supervised and unsupervised representations.
翻译:不受监督的语音表达方式已经消失,基准(SUPERB、ZeroSpeech)显示半监督语音识别、语音合成和只使用语音的语言建模方面取得重大进展。灵感来自“发现一种语言的电话’或类似的低位编码”的许诺。然而,电话抄录的关键特性之一是背景差异:语音声音的语音背景可以对发音方式产生巨大影响,而文本则保持稳定。这就是允许同一词的象征拥有相同的抄录 -- -- 语言理解的关键。当前基准并不衡量背景差异。我们开发了ZeroSpeech ABX基准的新版本,以衡量背景差异,并将其应用于最近的自我监督表达方式。我们证明,表达的背景独立性可以预测字级表达方式的稳定性。我们建议集中研究如何改善自我监督和不受监督的表达方式的背景独立性。