In this paper, we introduce a pretrained audio-visual Transformer trained on more than 500k utterances from nearly 4000 celebrities from the VoxCeleb2 dataset for human behavior understanding. The model aims to capture and extract useful information from the interactions between human facial and auditory behaviors, with application in emotion recognition. We evaluate the model performance on two datasets, namely CREMAD-D (emotion classification) and MSP-IMPROV (continuous emotion regression). Experimental results show that fine-tuning the pre-trained model helps improving emotion classification accuracy by 5-7% and Concordance Correlation Coefficients (CCC) in continuous emotion recognition by 0.03-0.09 compared to the same model trained from scratch. We also demonstrate the robustness of finetuning the pre-trained model in a low-resource setting. With only 10% of the original training set provided, fine-tuning the pre-trained model can lead to at least 10% better emotion recognition accuracy and a CCC score improvement by at least 0.1 for continuous emotion recognition.
翻译:在本文中,我们引入了来自VoxCeleb2 数据集的近4000名名名人经过500公里以上话语培训的预先培训的视听变换器,用于理解人类行为。模型旨在捕捉和从人类面部和听觉行为之间的互动中提取有用的信息,并应用情感识别。我们评估了两个数据集的模型性能,即CREMAD-D(情感分类)和MSP-IMPROV(连续情感回归)。实验结果表明,对预培训模型的微调有助于将情感分类精度提高5-7%,而CCC(CCC)在连续情感识别方面提高了0.03-0.09,而从零开始培训的模型则提高了0.1。我们还展示了在低资源环境下对预培训模型进行微调的稳健性。仅提供10%的原始培训,对预培训模型的微调能导致至少10%的情感识别精度,并使CCC的分数提高,至少为0.1。