Human affective recognition is an important factor in human-computer interaction. However, the method development with in-the-wild data is not yet accurate enough for practical usage. In this paper, we introduce the affective recognition method focusing on facial expression (EXP) and valence-arousal calculation that was submitted to the Affective Behavior Analysis in-the-wild (ABAW) 2021 Contest. When annotating facial expressions from a video, we thought that it would be judged not only from the features common to all people, but also from the relative changes in the time series of individuals. Therefore, after learning the common features for each frame, we constructed a facial expression estimation model and valence-arousal model using time-series data after combining the common features and the standardized features for each video. Furthermore, the above features were learned using multi-modal data such as image features, AU, Head pose, and Gaze. In the validation set, our model achieved a facial expression score of 0.546. These verification results reveal that our proposed framework can improve estimation accuracy and robustness effectively.
翻译:人类感官认知是人与计算机互动的一个重要因素。 然而,使用全方位数据的方法开发尚不够准确,无法实际使用。 在本文中,我们引入了以面部表达(EXP)和价值激励计算为焦点的感官识别方法,该方法已经提交给了2021年视觉动性行为分析(ABAW)竞赛。当从视频中注意到面部表达时,我们认为,将不仅根据所有人共同的特点,而且根据个人时间序列的相对变化来判断。因此,在学习了每个框架的共同特征之后,我们用每个视频的共同特征和标准化特征合并了时间序列数据,构建了一个面部表达估计模型和价值激励模型。此外,上述特征是使用图像特征、AU、头部姿势和Gaze等多模式数据学习的。在论证集中,我们模型的面部表达得分为0.546。这些核查结果显示,我们提议的框架可以有效地提高估计准确性和稳健度。