In this paper, we propose a new framework for environmental sound synthesis using onomatopoeic words and sound event labels. The conventional method of environmental sound synthesis, in which only sound event labels are used, cannot finely control the time-frequency structural features of synthesized sounds, such as sound duration, timbre, and pitch. There are various ways to express environmental sound other than sound event labels, such as the use of onomatopoeic words. An onomatopoeic word, which is a character sequence for phonetically imitating a sound, has been shown to be effective for describing the phonetic feature of sounds. We believe that environmental sound synthesis using onomatopoeic words will enable us to control the fine time-frequency structural features of synthesized sounds, such as sound duration, timbre, and pitch. In this paper, we thus propose environmental sound synthesis from onomatopoeic words on the basis of a sequence-to-sequence framework. To convert onomatopoeic words to environmental sound, we use a sequence-to-sequence framework. We also propose a method of environmental sound synthesis using onomatopoeic words and sound event labels to control the fine time-frequency structure and frequency property of synthesized sounds. Our subjective experiments show that the proposed method achieves the same level of sound quality as the conventional method using WaveNet. Moreover, our methods are better than the conventional method in terms of the expressiveness of synthesized sounds to onomatopoeic words.
翻译:在本文中, 我们提出一个新的环境健全合成框架。 常规环境健全合成方法( 仅使用合理的事件标签 ) 无法精密控制合成声音的时间- 频率结构特征, 例如声音持续时间、 Timbre 和 音调。 本文中, 有多种方法可以表达环境健全而非无害事件标签, 比如使用在线语言 。 一个 Ocomotopooe 字词( 是音调合成声音的字符序列 ) 已被证明有效描述声音的音调特征 。 我们认为, 使用有声调标志的音调合成传统方法将使我们能够控制合成声音的精细时间- 频率结构特征, 比如音调持续时间、 Timbre 和 音调。 因此, 在本文中, 我们建议用有声调的词来表达环境声音合成, 将有色调的词转换为有声调的音调词, 我们用一个从音调的直线词框架 。 我们还建议一种使用有色的音调合成方法, 将我们所使用的频率分析方法, 将我们所使用的音调分析方法 的音调分析方法 。