In this paper we present the Amharic Speech Emotion Dataset (ASED), which covers four dialects (Gojjam, Wollo, Shewa and Gonder) and five different emotions (neutral, fearful, happy, sad and angry). We believe it is the first Speech Emotion Recognition (SER) dataset for the Amharic language. 65 volunteer participants, all native speakers, recorded 2,474 sound samples, two to four seconds in length. Eight judges assigned emotions to the samples with high agreement level (Fleiss kappa = 0.8). The resulting dataset is freely available for download. Next, we developed a four-layer variant of the well-known VGG model which we call VGGb. Three experiments were then carried out using VGGb for SER, using ASED. First, we investigated whether Mel-spectrogram features or Mel-frequency Cepstral coefficient (MFCC) features work best for Amharic. This was done by training two VGGb SER models on ASED, one using Mel-spectrograms and the other using MFCC. Four forms of training were tried, standard cross-validation, and three variants based on sentences, dialects and speaker groups. Thus, a sentence used for training would not be used for testing, and the same for a dialect and speaker group. The conclusion was that MFCC features are superior under all four training schemes. MFCC was therefore adopted for Experiment 2, where VGGb and three other existing models were compared on ASED: RESNet50, Alex-Net and LSTM. VGGb was found to have very good accuracy (90.73%) as well as the fastest training time. In Experiment 3, the performance of VGGb was compared when trained on two existing SER datasets, RAVDESS (English) and EMO-DB (German) as well as on ASED (Amharic). Results are comparable across these languages, with ASED being the highest. This suggests that VGGb can be successfully applied to other languages. We hope that ASED will encourage researchers to experiment with other models for Amharic SER.
翻译:在本文中,我们展示了阿姆哈拉语言情感数据集(ASED),该数据集涵盖四种方言(Gojjam、Wollo、Shewa和Gonder)和五种不同的情感(中立、恐惧、快乐、悲哀和愤怒)。我们认为这是阿姆哈拉语首个语音情感识别数据集。65名志愿者参与者(所有本地演讲者,记录了2,474个声音样本,两至四秒钟长度。8名法官将情感分配给样本,且协议级别很高(Fleis kappa=0.8)。由此产生的数据集可以免费下载。接下来,我们开发了众所周知的VGGM模型的四层变式。我们称之为VGB。然后,我们用VGBS(S)做了三次实验。首先,我们调查了Mel-spectrogrogram特征或Mel-频 Cepstral 系数(MFCC) 是否为Amaricrial 工作最优。这是通过在ASED上培训两个VGB模型, 使用M-SB模型, 和MFC(我们使用MED) 三个变数模型进行这种模拟测试, 和变数模型。一个测试了。