There is a growing need for social robots and intelligent agents that can effectively interact with and support users. For the interactions to be seamless, the agents need to analyse social scenes and behavioural cues from their (robot's) perspective. Works that model human-agent interactions in social situations are few; and even those existing ones are computationally too intensive to be deployed in real time or perform poorly in real-world scenarios when only limited information is available. We propose a knowledge distillation framework that models social interactions through various multimodal cues, and yet is robust against incomplete and noisy information during inference. We train a teacher model with multimodal input (body, face and hand gestures, gaze, raw images) that transfers knowledge to a student model which relies solely on body pose. Extensive experiments on two publicly available human-robot interaction datasets demonstrate that our student model achieves an average accuracy gain of 14.75% over competitive baselines on multiple downstream social understanding tasks, even with up to 51% of its input being corrupted. The student model is also highly efficient - less than 1% in size of the teacher model in terms of parameters and its latency is 11.9% of the teacher model. Our code and related data are available at github.com/biantongfei/SocialEgoMobile.
翻译:社会机器人和智能体需要能够有效与用户交互并提供支持的需求日益增长。为实现无缝交互,智能体需从自身(机器人)视角分析社交场景和行为线索。现有建模社交情境下人机交互的研究较少,且现有模型计算强度过高,难以实时部署,或在现实场景中仅能获取有限信息时表现不佳。我们提出一种知识蒸馏框架,通过多种多模态线索建模社交交互,并在推理过程中对不完整和噪声信息具有鲁棒性。我们训练了一个采用多模态输入(身体、面部和手势、视线、原始图像)的教师模型,其将知识迁移至仅依赖身体姿态的学生模型。在两个公开可用的人机交互数据集上的大量实验表明,即使在输入数据损坏率高达51%的情况下,我们的学生模型在多项下游社交理解任务中相比竞争基线平均准确率提升14.75%。该学生模型同时具备高效性——参数量仅为教师模型的1%以下,延迟时间仅为教师模型的11.9%。我们的代码及相关数据已发布于github.com/biantongfei/SocialEgoMobile。