Numerous self-supervised learning (SSL) models for speech have been proposed for pre-training models of speech representations, and recent SSL models are very successful in diverse downstream tasks. To understand such utilities, previous works probe representations of speech models to reveal which & how speech related information is encoded in the learned representations. While encoding properties have been extensively explored from the perspective of acoustics, phonetics, and semantics, the physical grounding by speech production has not yet received full attention. To bridge this gap, we conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA). Our analysis is based on a linear probing approach where we measure articulatory score as an average correlation of linear mapping to EMA. We analyze a set of SSL models selected from the leaderboard of the SU- PERB benchmark and perform further detailed analyses on two major models, Wav2Vec 2.0 and HuBERT. Surprisingly, representations from the recent speech SSL models are highly correlated with EMA traces (best: r = 0.81), and only 5 minutes were sufficient to train a linear model with high performance (r = 0.77). Our findings suggest that SSL models learn to closely align with continuous articulations and provide a novel insight into speech SSL.
翻译:语言表现培训前模式提出了许多自我监督的语音学习模式(SSL),最近的一些SSL模型在各种下游任务中非常成功。为了理解这些功能,以前的工作考察演示演示演示语言模型,以显示哪些语言模型以及如何在学习的演示中编码与语音相关的信息。虽然从声学、语音和语义学的角度广泛探索了编码特性,但以语音制作为实际基础的实际分析尚未得到充分重视。为了弥合这一差距,我们进行了全面分析,将语音表达与电磁脉动学测量的动脉道轨迹联系起来。我们的分析基于线性研究方法,我们测量线性脉动得分,作为线性制图与EMA的平均关系。我们分析了从SU-PERB基准的主导板上挑选的一套SLS模型,对Wav2Vec 2.0和HuBERT这两个主要模型进行了进一步的详细分析。令人惊讶的是,最近SLSL模型的演示与EM的跟踪结果高度关联(最接近:R=0.81,我们仅用5分钟的直径模型来进行高水平的SL培训,以便从SLM学习新的SL模型。