Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person's audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user's expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2% for arousal and valence descriptors. Further, multimodal architectures show significant improvements over models trained on single modalities with gains of up to 3.6%. Ablation studies show the significance of the visual modality for the expression detection on the Aff-Wild2 database.
翻译:开发机器学习算法,以了解人与人之间的接触,可以产生亚马逊亚历山德拉等公共装置的自然用户经验。 在声音活动和凝视等其他提示中,一个人的视听表达方式,包括声音和面部表达的音调,可以作为各方在对话中接触的隐含信号。本研究调查了用于对用户表达方式进行视听检测的深学习算法。我们首先采用了一个具有经常性层次的视听基线模型,该层次显示与目前艺术状态相比具有竞争性的结果。接下来,我们建议采用具有编码层的变压器结构,将视听特征更好地结合到语音跟踪中。 Aff-Wild2 数据库的绩效显示,拟议方法的运行优于基线结构,经常层的运行绝对收益约为2%。此外,多式结构显示对单一模式培训模型的重大改进,其收益高达3.6%。对比研究显示Aff-Wild2数据库的语音检测视觉模式的重要性。