The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficiently - and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.
翻译:深层次学习算法的兴起导致许多研究人员退出了使用传统的声音生成信号处理方法。深层学习模型已经实现了声音的表达合成、现实的音质质和来自虚拟仪器的音乐笔记。然而,最合适的深层次学习结构仍在调查之中。结构的选择与音频表达方式紧密结合。声音的原始波形可能过于密集和丰富,以致深层次学习模型无法有效处理,复杂性增加了培训时间和计算成本。此外,它并不代表人们所感知的方式。因此,在许多情况下,原始音频已经转化成一种压缩和更加有意义的形式,使用了更新、地貌外观,甚至采用了更高级别的波形图示。此外,还研究了所选的形式、附加的调整表、不同的模型结构以及许多用于评价重塑声音的衡量标准。本文件概述了运用深层学习对声音合成应用的音频表达方式。此外,它介绍了利用深层学习模型开发和评价声音合成结构的最重要方法,始终取决于音频表达方式。