If a picture paints a thousand words, sound may voice a million. While recent robotic painting and image synthesis methods have achieved progress in generating visuals from text inputs, the translation of sound into images is vastly unexplored. Generally, sound-based interfaces and sonic interactions have the potential to expand accessibility and control for the user and provide a means to convey complex emotions and the dynamic aspects of the real world. In this paper, we propose an approach for using sound and speech to guide a robotic painting process, known here as robot synesthesia. For general sound, we encode the simulated paintings and input sounds into the same latent space. For speech, we decouple speech into its transcribed text and the tone of the speech. Whereas we use the text to control the content, we estimate the emotions from the tone to guide the mood of the painting. Our approach has been fully integrated with FRIDA, a robotic painting framework, adding sound and speech to FRIDA's existing input modalities, such as text and style. In two surveys, participants were able to correctly guess the emotion or natural sound used to generate a given painting more than twice as likely as random chance. On our sound-guided image manipulation and music-guided paintings, we discuss the results qualitatively.
翻译:如果图片绘制一千个字,声音可能会发出100万个声音。虽然最近的机器人绘画和图像合成方法在从文本输入中生成视觉方面已经取得了进步,但将声音转换成图像却远未探索。一般而言,基于声音的界面和声波互动有可能扩大用户的可访问性和控制范围,为传递复杂的情绪和真实世界的动态方面提供一种手段。在本文中,我们提出了一个使用声音和语言指导机器人绘画过程的方法,这里被称为机器人合成。一般声音,我们把模拟绘画和输入的声音编码到同一个隐性空间。对于演讲来说,我们将语音转换成其转录文本和演讲的语调。虽然我们使用文本控制内容,但我们根据语调来估计情绪来指导画的情绪。我们的方法已经完全融入了FIDA,一个机器人绘画框架,为FIDA现有的投入模式添加声音和语音,例如文字和风格。在两次调查中,参与者能够正确地猜测用于制作绘画的情感或自然声音,而不是两次以上。我们可能作为随机制制制的图像。