Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alternative way of extracting speaker and emotion information from self-supervised trained models, based on the correlations between the coefficients of the representations - correlation pooling. We show improvements over mean pooling and further gains when the pooling methods are combined via fusion. The code is available at github.com/Lamomal/s3prl_correlation.
翻译:在自我监督的情况下,从大量未贴标签的数据中学习语音表达方式,使一些语音处理任务得以取得最新成果。这些语音表述方式通常通过使用描述性统计数据,特别是使用第一级和第二级代表性系数统计数据,在时间上进行汇总。在本文中,我们根据自我监督的训练有素模型的系数 -- -- 相关集合 -- -- 的相互关系,研究从自我监督的模型中提取演讲者和情感信息的替代方法。我们展示了在通过聚合合并合并方法合并时,在平均集合和进一步收益方面的改进。代码可在 guthub.com/Lamomal/s3prl_orl_oration查阅。