Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96\% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the network, stable performance across several layers allows a single universal representation to reach near optimal performance on all tasks.
翻译:许多语言应用要求超越所讲语言的理解面,例如承认情感,发现演讲者是否戴着面具,或与合成语言有区别。在这项工作中,我们引入了一种新的最先进的语言代表,这种语言代表来自对600M+参数的大规模、完全自监督的基于Conferent 的大规模培训。我们以一套不同的演讲任务为基准,并表明在时间平均代表制上受过培训的简单线性分类人员几乎超越了以往的几乎所有结果,有时是大差幅。我们对背景窗口规模的分析表明,令人惊讶的是,在9项任务中,有2个背景窗口实现了96个,即使用全部长期背景的同龄人的表现。此外,尽管网络内部对每个任务的最佳表现进行了内部演算,但多层的稳定表现使得单一的普遍代表制能够在所有任务上达到接近最佳业绩。