The aim of this work is to define a speech emotion recognition (SER) model able to recognize positive, neutral and negative emotions in natural conversations of Italian elderly people. Several datasets for SER are available in the literature. However most of them are in English or Chinese, have been recorded while actors and actresses pronounce short phrases and thus are not related to natural conversation. Moreover only few speeches among all the databases are related to elderly people. Therefore, in this work, a multi-language and multi-age corpus is considered merging a dataset in English, that includes also elderly people, with a dataset in Italian. A general model, trained on young and adult English actors and actresses is proposed, based on XGBoost. Then two strategies of domain adaptation are proposed to adapt the model either to elderly people and to Italian speakers. The results suggest that this approach increases the classification performance, underlining also that new datasets should be collected.
翻译:这项工作的目的是界定一种言语情感识别模式(SER),能够识别意大利老年人在自然对话中的积极、中立和消极情绪。文献中可提供SER的若干数据集,但大多数都是英文或中文,已经记录下来,而演员和演员发表短语,因此与自然对话无关。此外,所有数据库中只有很少的演讲与老年人有关。因此,在这项工作中,多语种和多语种的集合体被视为将一个英文数据集合并在一起,其中也包括老年人,并有意大利语数据集。提出了一个通用模型,以XGBoost为基础,对英国年轻和成人演员进行培训。然后,提出了两种领域适应战略,以适应老年人和意大利语的演讲者。结果显示,这一方法提高了分类的性能,同时强调应收集新的数据集。