Humans use audio signals in the form of spoken language or verbal reactions effectively when teaching new skills or tasks to other humans. While demonstrations allow humans to teach robots in a natural way, learning from trajectories alone does not leverage other available modalities including audio from human teachers. To effectively utilize audio cues accompanying human demonstrations, first it is important to understand what kind of information is present and conveyed by such cues. This work characterizes audio from human teachers demonstrating multi-step manipulation tasks to a situated Sawyer robot using three feature types: (1) duration of speech used, (2) expressiveness in speech or prosody, and (3) semantic content of speech. We analyze these features along four dimensions and find that teachers convey similar semantic concepts via spoken words for different conditions of (1) demonstration types, (2) audio usage instructions, (3) subtasks, and (4) errors during demonstrations. However, differentiating properties of speech in terms of duration and expressiveness are present along the four dimensions, highlighting that human audio carries rich information, potentially beneficial for technological advancement of robot learning from demonstration methods.
翻译:人类在向他人传授新技能或任务时,使用口语或口头反应形式的音频信号。虽然示威允许人类以自然方式教授机器人,但仅从轨迹中学习并不能发挥其他可用模式的杠杆作用,包括人类教师的音频。为了有效利用伴随人类演示的音频信号,首先必须了解人类演示中存在和传递的信息类型。这项工作将展示多步操纵任务给定位的索伊尔机器人的音频特征分为三种特征:(1) 使用的语音持续时间,(2) 言论或作曲的表达能力,(3) 语言的语义内容。我们从四个方面分析这些特征,发现教师在演示期间通过口语传递类似的语义概念,包括:(1) 演示类型,(2) 音频使用指示,(3) 子任务和(4) 错误。然而,在四个方面,在持续时间和表达能力方面区分了语言的属性,同时强调人类音频包含丰富的信息,有可能有利于机器人从演示方法学习的技术进步。