We propose a method for synthesizing environmental sounds from visually represented onomatopoeias and sound sources. An onomatopoeia is a word that imitates a sound structure, i.e., the text representation of sound. From this perspective, onoma-to-wave has been proposed to synthesize environmental sounds from the desired onomatopoeia texts. Onomatopoeias have another representation: visual-text representations of sounds in comics, advertisements, and virtual reality. A visual onomatopoeia (visual text of onomatopoeia) contains rich information that is not present in the text, such as a long-short duration of the image, so the use of this representation is expected to synthesize diverse sounds. Therefore, we propose visual onoma-to-wave for environmental sound synthesis from visual onomatopoeia. The method can transfer visual concepts of the visual text and sound-source image to the synthesized sound. We also propose a data augmentation method focusing on the repetition of onomatopoeias to enhance the performance of our method. An experimental evaluation shows that the methods can synthesize diverse environmental sounds from visual text and sound-source images.
翻译:我们建议了一种方法,将视觉代表的奥多莫托科伊亚和声源中的环境声音合成为一体。 一种显眼的奥多莫伊亚是一个模仿声音结构的词, 即声音的文字表达方式。 从这个角度, 已经提议了通过对波合成来自理想的奥多托奥伊亚文本的环境声音。 Onotooeia 还有一个表达方式: 漫画、 广告和虚拟现实中声音的视觉- 文本表达方式。 一个视觉的奥多托奥伊亚( Onototooiea) 包含在文本中不存在的丰富信息, 例如图像的长短时间, 因此使用这种表达方式可以合成多种声音。 因此, 我们提议了从视觉的奥多托奥伊亚语文本和声源图像中进行环境声音合成的视觉- 。 该方法可以将视觉文本和声源图像的视觉概念转移到合成声音中。 我们还提议一种数据增强能力的方法, 侧重于对奥托波亚的重复来增强我们的方法的性能。 实验性评估显示, 方法可以综合从视觉文本和声源中产生的不同环境声音。