We present Perceive-Represent-Generate (PRG), a novel three-stage framework that maps perceptual information of different modalities (e.g., visual or sound), corresponding to a sequence of instructions, to an adequate sequence of movements to be executed by a robot. In the first stage, we perceive and pre-process the given inputs, isolating individual commands from the complete instruction provided by a human user. In the second stage we encode the individual commands into a multimodal latent space, employing a deep generative model. Finally, in the third stage we convert the multimodal latent values into individual trajectories and combine them into a single dynamic movement primitive, allowing its execution in a robotic platform. We evaluate our pipeline in the context of a novel robotic handwriting task, where the robot receives as input a word through different perceptual modalities (e.g., image, sound), and generates the corresponding motion trajectory to write it, creating coherent and readable handwritten words.
翻译:我们提出了一个新颖的三阶段框架(PRG),根据一个指令序列,绘制不同模式(如视觉或声音)的感知信息(如视觉或声音),与一个机器人执行的适当运动顺序相对应。在第一阶段,我们感知和预处理给定投入,将单个指令与一个人类用户提供的完整指令隔离开来。在第二阶段,我们利用一个深厚的感知模型,将单个指令编码成一个多式潜伏空间。最后,在第三阶段,我们把多式联运潜在值转换成个体轨迹,将其合并成一个单一的动态运动,允许在机器人平台上执行。我们评估了在新机器人笔迹任务背景下的管道,机器人通过不同的感知模式(如图像、声音)接收一个单词作为输入,并生成相应的运动轨迹来写成它,产生一致和可读手写字。