Speaker profiling, which aims to estimate speaker characteristics such as age and height, has a wide range of applications inforensics, recommendation systems, etc. In this work, we propose a semisupervised learning approach to mitigate the issue of low training data for speaker profiling. This is done by utilizing external corpus with speaker information to train a better representation which can help to improve the speaker profiling systems. Specifically, besides the standard supervised learning path, the proposed framework has two more paths: (1) an unsupervised speaker representation learning path that helps to capture the speaker information; (2) a consistency training path that helps to improve the robustness of the system by enforcing it to produce similar predictions for utterances of the same speaker.The proposed approach is evaluated on the TIMIT and NISP datasets for age, height, and gender estimation, while the Librispeech is used as the unsupervised external corpus. Trained both on single-task and multi-task settings, our approach was able to achieve state-of-the-art results on age estimation on the TIMIT Test dataset with Root Mean Square Error(RMSE) of6.8 and 7.4 years and Mean Absolute Error(MAE) of 4.8 and5.0 years for male and female speakers respectively.
翻译:在这项工作中,我们提出一种半监督的学习方法,以缓解低语言特征分析培训数据问题,办法是利用外源信息,利用外源信息,培训更好的代表性,从而帮助改进演讲者特征分析系统。具体地说,除了标准监督的学习路径外,拟议框架还有两条道路:(1) 一种不受监督的演讲者代表性学习路径,有助于获取演讲者信息;(2) 一种一致性培训途径,通过实施该系统,对同一发言者的言论作出类似的预测,帮助提高系统的稳健性。 拟议的方法是在TIMIT和NISP数据集上评价年龄、身高和性别估计,同时使用Librispeech作为不受监督的外源。我们的方法在单一任务和多任务环境中都进行了培训,从而得以在TIMIT测试数据集上取得最新的年龄估计结果,分别针对6年、7.8年和7.8年(RMSE)和7.8年(ERM)的 " 男性4.18 " 和7.8年(Rimal) " 和 " 女性5.08 " 和 " 和7.8年(Rimal)的 " 。