XiaoiceSing is a singing voice synthesis (SVS) system that aims at generating 48kHz singing voices. However, the mel-spectrogram generated by it is over-smoothing in middle- and high-frequency areas due to no special design for modeling the details of these parts. In this paper, we propose XiaoiceSing2, which can generate the details of middle- and high-frequency parts to better construct the full-band mel-spectrogram. Specifically, in order to alleviate this problem, XiaoiceSing2 adopts a generative adversarial network (GAN), which consists of a FastSpeech-based generator and a multi-band discriminator. We improve the feed-forward Transformer (FFT) block by adding multiple residual convolutional blocks in parallel with the self-attention block to balance the local and global features. The multi-band discriminator contains three sub-discriminators responsible for low-, middle-, and high-frequency parts of the mel-spectrogram, respectively. Each sub-discriminator is composed of several segment discriminators (SD) and detail discriminators (DD) to distinguish the audio from different aspects. The experiment on our internal 48kHz singing voice dataset shows XiaoiceSing2 significantly improves the quality of the singing voice over XiaoiceSing.
翻译:小型和高频合成(SVS)系统,旨在生成48kHz的歌声。然而,它生成的Mel-spectrogram在中高频和中高频区域过于吸附,因为没有为这些部分的细节建模的特殊设计。在本文中,我们提议小化Sing2, 它可以生成中高频部分的细节, 以便更好地构建全频和全频 mel- spectrogrogram。具体地说,为了缓解这一问题, 小化Sing2 采用了一个基因化对抗网络(GAN ), 由快速语音生成器和多频带歧视器组成。我们改进了向上变频变频器(FFT)块,方法是与自我注意区块同时增加多个剩余变频区块以平衡本地和全球的特征。多频带歧视器包含负责低频、中高频和高频区段语音谱的3个次分解器。每个次分辨器由快速式变音器组成,从若干段变式变式SDD(SD) 和大量变式变式数据。