Affect understanding capability is essential for social robots to autonomously interact with a group of users in an intuitive and reciprocal way. However, the challenge of multi-person affect understanding comes from not only the accurate perception of each user's affective state (e.g., engagement) but also the recognition of the affect interplay between the members (e.g., joint engagement) that presents as complex, but subtle, nonverbal exchanges between them. Here we present a novel hybrid framework for identifying a parent-child dyad's joint engagement by combining a deep learning framework with various video augmentation techniques. Using a dataset of parent-child dyads reading storybooks together with a social robot at home, we first train RGB frame- and skeleton-based joint engagement recognition models with four video augmentation techniques (General Aug, DeepFake, CutOut, and Mixed) applied datasets to improve joint engagement classification performance. Second, we demonstrate experimental results on the use of trained models in the robot-parent-child interaction context. Third, we introduce a behavior-based metric for evaluating the learned representation of the models to investigate the model interpretability when recognizing joint engagement. This work serves as the first step toward fully unlocking the potential of end-to-end video understanding models pre-trained on large public datasets and augmented with data augmentation and visualization techniques for affect recognition in the multi-person human-robot interaction in the wild.
翻译:社会机器人以直觉和对等的方式自主地与一组用户互动,理解能力至关重要。然而,多人影响理解的挑战不仅来自对每个用户的感性状态(例如参与)的准确认识,还来自对成员之间影响互动(例如联合接触)的认识,这些互动显示它们之间复杂但微妙、非语言的交流。我们在这里提出了一个新的混合框架,通过将深层次学习框架与各种视频增强技术相结合,确定母子对子联合接触。我们使用母子互动故事书和在家社交机器人一起阅读的故事书数据集,我们首先用四种视频增强技术(General、DeepFake、CutOut和Met)应用数据集来提高联合接触分类绩效。第二,我们展示了在机器人对母子儿童互动背景下使用经过培训的模型的实验性结果。我们采用基于行为的衡量标准,用以评价在确认联合接触时,在对模型进行理解时,对模型进行学习的表述,以调查模型的可扩展性读性读性,从而了解数据在最终确认数据中,从而充分理解。