Proactive human-robot interaction (HRI) allows the receptionist robots to actively greet people and offer services based on vision, which has been found to improve acceptability and customer satisfaction. Existing approaches are either based on multi-stage decision processes or based on end-to-end decision models. However, the rule-based approaches require sedulous expert efforts and only handle minimal pre-defined scenarios. On the other hand, existing works with end-to-end models are limited to very general greetings or few behavior patterns (typically less than 10). To address those challenges, we propose a new end-to-end framework, the TransFormer with Visual Tokens for Human-Robot Interaction (TFVT-HRI). The proposed framework extracts visual tokens of relative objects from an RGB camera first. To ensure the correct interpretation of the scenario, a transformer decision model is then employed to process the visual tokens, which is augmented with the temporal and spatial information. It predicts the appropriate action to take in each scenario and identifies the right target. Our data is collected from an in-service receptionist robot in an office building, which is then annotated by experts for appropriate proactive behavior. The action set includes 1000+ diverse patterns by combining language, emoji expression, and body motions. We compare our model with other SOTA end-to-end models on both offline test sets and online user experiments in realistic office building environments to validate this framework. It is demonstrated that the decision model achieves SOTA performance in action triggering and selection, resulting in more humanness and intelligence when compared with the previous reactive reception policies.
翻译:积极主动的人类机器人互动(HRI)使接待机器人能够积极向人们致意,并以愿景为基础提供服务,这被认为可以提高可接受性和客户满意度。现有办法要么基于多阶段决策程序,要么基于端到端决定模式。然而,基于规则的办法需要沉闷的专家努力,而只处理最起码的预设情景。另一方面,与端到端模式的现有工作仅限于非常一般的问候或很少的行为模式(通常少于10)。为了应对这些挑战,我们提议了一个新的端到端框架,即 " Transformer with Conference Tokens for Human-Robot Exactive(TFVT-HRI) " 。拟议的框架要么基于多阶段决策程序,要么基于端到端决定模式,要么基于端到端决定模式;为确保对情景作出正确解释,然后采用变压式决定模式处理视觉标语,而时间和空间信息则得到加强。它预测在每种情景下采取的适当行动,并确定正确的目标。我们的数据是从一个服务型的接收机器人模型在办公室建设中收集数据,然后将SO-bobent Recent Recent real ad deal ex deal ex deal ex deal ex ex ex exactactal ex ex ex ex exactal exactal exactal exactal exactal exactal deactal ex ex ex ex ex ex ex ex ex exactal aactal exactutusmactal aactal aactal lautusmactal lautal ex ex ex ex ex ex lautal laction lautusal ex lactions lautusal lactions