Speech Emotion Recognition (SER) is a key affective computing technology that enables emotionally intelligent artificial intelligence. While SER is challenging in general, it is particularly difficult for low-resource languages such as Urdu. This study investigates Urdu SER in a cross-corpus setting, an area that has remained largely unexplored. We employ a cross-corpus evaluation framework across three different Urdu emotional speech datasets to test model generalization. Two standard domain-knowledge based acoustic feature sets, eGeMAPS and ComParE, are used to represent speech signals as feature vectors which are then passed to Logistic Regression and Multilayer Perceptron classifiers. Classification performance is assessed using unweighted average recall (UAR) whilst considering class-label imbalance. Results show that Self-corpus validation often overestimates performance, with UAR exceeding cross-corpus evaluation by up to 13%, underscoring that cross-corpus evaluation offers a more realistic measure of model robustness. Overall, this work emphasizes the importance of cross-corpus validation for Urdu SER and its implications contribute to advancing affective computing research for underrepresented language communities.
翻译:语音情感识别(SER)是一项关键的情感计算技术,能够赋予人工智能情感智能。尽管语音情感识别在一般情况下已具挑战性,对于乌尔都语等低资源语言而言尤为困难。本研究在跨语料库场景下探索乌尔都语语音情感识别,该领域目前尚未得到充分研究。我们采用跨语料库评估框架,在三个不同的乌尔都语情感语音数据集上测试模型的泛化能力。使用两种基于领域知识的标准化声学特征集——eGeMAPS和ComParE——将语音信号表示为特征向量,随后输入逻辑回归和多层感知机分类器。在考虑类别标签不平衡的情况下,采用未加权平均召回率(UAR)评估分类性能。结果表明:自语料库验证往往会高估模型性能,其UAR较跨语料库评估最高可超出13%,这凸显了跨语料库评估能为模型鲁棒性提供更真实的度量。总体而言,本研究强调了跨语料库验证对乌尔都语语音情感识别的重要性,其研究成果有助于推动针对代表性不足语言社群的情感计算研究。