This research project investigates the application of deep learning to timbre transfer, where the timbre of a source audio can be converted to the timbre of a target audio with minimal loss in quality. The adopted approach combines Variational Autoencoders with Generative Adversarial Networks to construct meaningful representations of the source audio and produce realistic generations of the target audio and is applied to the Flickr 8k Audio dataset for transferring the vocal timbre between speakers and the URMP dataset for transferring the musical timbre between instruments. Furthermore, variations of the adopted approach are trained, and generalised performance is compared using the metrics SSIM (Structural Similarity Index) and FAD (Frech\'et Audio Distance). It was found that a many-to-many approach supersedes a one-to-one approach in terms of reconstructive capabilities, and that the adoption of a basic over a bottleneck residual block design is more suitable for enriching content information about a latent space. It was also found that the decision on whether cyclic loss takes on a variational autoencoder or vanilla autoencoder approach does not have a significant impact on reconstructive and adversarial translation aspects of the model.
翻译:这个研究项目调查了深层次学习对音质传输的应用, 将源音频的触角转换成目标音频的触角, 质量损失最小。 采用的方法将变式自动电解器与基因反对流网络结合起来, 以构建源音频有意义的表达方式, 产生现实的一代目标音频, 并应用Flickr 8k 音频数据集, 以在音频和URMP数据集之间传输音频调音频阵列, 以转移仪器之间的音频阵列。 此外, 对采用的方法的变异进行了培训, 并且将一般性能比作使用 SSIM( 结构相似指数) 和 FAD( Frech\'et 音频距离) 。 人们发现, 从重建能力上看, 多种到多种方法取代了一对一的方法, 并且对瓶端残余区设计采用基本方法更适合丰富关于隐性空间的内容信息。 另外, 发现, 有关自行车损失是否在变式自动转换模型或Vanilla 自动coder 方法的转化方式不会产生显著的影响。