Recent self-supervised learning (SSL) models have proven to learn rich representations of speech, which can readily be utilized by diverse downstream tasks. To understand such utilities, various analyses have been done for speech SSL models to reveal which and how information is encoded in the learned representations. Although the scope of previous analyses is extensive in acoustic, phonetic, and semantic perspectives, the physical grounding by speech production has not yet received full attention. To bridge this gap, we conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA). Our analysis is based on a linear probing approach where we measure articulatory score as an average correlation of linear mapping to EMA. We analyze a set of SSL models selected from the leaderboard of the SUPERB benchmark and perform further layer-wise analyses on two most successful models, Wav2Vec 2.0 and HuBERT. Surprisingly, representations from the recent speech SSL models are highly correlated with EMA traces (best: r = 0.81), and only 5 minutes are sufficient to train a linear model with high performance (r = 0.77). Our findings suggest that SSL models learn to align closely with continuous articulations, and provide a novel insight into speech SSL.
翻译:最近自我监督的学习模式(SSL)已经证明学习了丰富的语言表达方式,这些表达方式可以很容易地用于不同的下游任务。为了理解这些功能,已经对语音SSL模型进行了各种分析,以揭示哪些和如何将信息输入到所学的演示中。虽然以前的分析范围在声学、语音和语义学角度上很广,但通过语音制作进行物理定位尚未得到充分重视。为了缩小这一差距,我们进行了全面分析,将语音表达方式与电磁电动脉动学测量的脉动轨迹联系起来。我们的分析基于线性实验方法,我们测量脉动分数作为向EMA线性绘图的平均相关性。我们分析了从SUPERB基准的首板上选定的一套SSL模型,并对两种最成功的模型Wav2Vec 2.0和HuBERT进行了进一步的分层分析。令人惊讶的是,最近SL模型的表达方式与EMA痕迹(最接近:r=0.81)高度相近。我们只有5分钟的时间来将线性定位模型与高性图像进行不断的SL的精确的演示。