Truly real-life data presents a strong, but exciting challenge for sentiment and emotion research. The high variety of possible `in-the-wild' properties makes large datasets such as these indispensable with respect to building robust machine learning models. A sufficient quantity of data covering a deep variety in the challenges of each modality to force the exploratory analysis of the interplay of all modalities has not yet been made available in this context. In this contribution, we present MuSe-CaR, a first of its kind multimodal dataset. The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge, and focused on the tasks of emotion, emotion-target engagement, and trustworthiness recognition by means of comprehensively integrating the audio-visual and language modalities. Furthermore, we give a thorough overview of the dataset in terms of collection and annotation, including annotation tiers not used in this year's MuSe 2020. In addition, for one of the sub-challenges - predicting the level of trustworthiness - no participant outperformed the baseline model, and so we propose a simple, but highly efficient Multi-Head-Attention network that exceeds using multimodal fusion the baseline by around 0.2 CCC (almost 50 % improvement).
翻译:真实生活中的数据是情感和情感研究的一个强大但令人兴奋的挑战。 可能存在的“ 现成” 特性的多样性使得大型数据集,如这些数据集在建设强大的机器学习模型方面不可或缺。 目前还没有足够数量的数据,涵盖每种模式的巨大挑战,以迫使对所有模式的相互作用进行探索性分析。在此贡献中,我们介绍Muse-CaR,这是其首个类型的多式联运数据集。这些数据是公开的,因为它最近成为了第一个多式感知分析挑战的测试台,侧重于情感、情感-目标参与和信任度确认等任务,其方法是全面整合视听和语言模式。此外,我们从收集和注解的角度对数据集作了透彻的概述,包括本年度 MuSe 2020年没有使用的注解层。此外,对于其中的一个子挑战,即预测信任度,最近没有参与者超过基线模型,因此我们建议采用简单但高效的多式核心核心数据库,使用最高效的多式模型改进。