The ICML Expressive Vocalizations (ExVo) Multi-task challenge 2022, focuses on understanding the emotional facets of the non-linguistic vocalizations (vocal bursts (VB)). The objective of this challenge is to predict emotional intensities for VB, being a multi-task challenge it also requires to predict speakers' age and native-country. For this challenge we study and compare two distinct embedding spaces namely, self-supervised learning (SSL) based embeddings and task-specific supervised learning based embeddings. Towards that, we investigate feature representations obtained from several pre-trained SSL neural networks and task-specific supervised classification neural networks. Our studies show that the best performance is obtained with a hybrid approach, where predictions derived via both SSL and task-specific supervised learning are used. Our best system on test-set surpasses the ComPARE baseline (harmonic mean of all sub-task scores i.e., $S_{MTL}$) by a relative $13\%$ margin.
翻译:对于这一挑战,我们研究和比较两个不同的嵌入空间,即基于自我监督的嵌入空间和基于任务的监督性学习嵌入空间。我们调查了几个预先培训的SSL神经网络和特定任务监督的分类神经网络的特征表现。我们的研究显示,最佳性能是通过混合方法取得的,其中使用了通过SSL和特定任务监督性学习得出的预测。我们的最佳测试设置系统比COMPARE基线(所有子任务分数的调和平均值,即,$S ⁇ MTL})高出大约13美元差值。