Self-Supervised Learning (SSL) from speech data has produced models that have achieved remarkable performance in many tasks, and that are known to implicitly represent many aspects of information latently present in speech signals. However, relatively little is known about the suitability of such models for prosody-related tasks or the extent to which they encode prosodic information. We present a new evaluation framework, SUPERB-prosody, consisting of three prosody-related downstream tasks and two pseudo tasks. We find that 13 of the 15 SSL models outperformed the baseline on all the prosody-related tasks. We also show good performance on two pseudo tasks: prosody reconstruction and future prosody prediction. We further analyze the layerwise contributions of the SSL models. Overall we conclude that SSL speech models are highly effective for prosody-related tasks.
翻译:语言数据中的自我监督学习(SSL)生成了一些模型,这些模型在许多任务中取得了显著的成绩,已知这些模型隐含地代表了语言信号中潜在的信息的许多方面,然而,对于这些模型是否适合与亲善有关的任务,或者这些模型在多大程度上编码了预告信息,人们相对所知甚少。我们提出了一个新的评价框架,即SUPERB-prosdy,由三项与亲善有关的下游任务和两项假任务组成。我们发现,15个SSL模型中有13个超出了所有亲善相关任务的基线。我们还展示了两种假任务的良好表现:亲善重建和未来亲善预测。我们进一步分析了SSL模型的层次贡献。我们总体上认为,SSL语言模型对于与亲善有关的任务非常有效。