Nowadays more and more applications can benefit from edge-based text-to-speech (TTS). However, most existing TTS models are too computationally expensive and are not flexible enough to be deployed on the diverse variety of edge devices with their equally diverse computational capacities. To address this, we propose FBWave, a family of efficient and scalable neural vocoders that can achieve optimal performance-efficiency trade-offs for different edge devices. FBWave is a hybrid flow-based generative model that combines the advantages of autoregressive and non-autoregressive models. It produces high quality audio and supports streaming during inference while remaining highly computationally efficient. Our experiments show that FBWave can achieve similar audio quality to WaveRNN while reducing MACs by 40x. More efficient variants of FBWave can achieve up to 109x fewer MACs while still delivering acceptable audio quality. Audio demos are available at https://bichenwu09.github.io/vocoder_demos.
翻译:目前,越来越多的应用可以受益于基于边缘的文本到语音(TTS),然而,大多数现有的TTS模型在计算上过于昂贵,而且不够灵活,无法在各种具有同样不同计算能力的边缘装置上部署。为了解决这个问题,我们提议FBWave, 这是一个高效且可扩缩的神经蒸汽器组成的大家庭,可以实现不同边缘装置的最佳性能-效率权衡。FBWave是一个基于流动的混合模型,它结合了自动递减和非自动递增模型的优势。它在推断期间产生高质量的音频和支持流流,同时保持高度的计算效率。我们的实验显示,FBWave可以达到与WaveRNN的类似音质,同时将MACs减少40x。 FBWave的更高效的变体可以在提供可接受的音质的同时达到109x更少的MACs。音频演示可在https://bichenwu09.github.io/vocode_demos上查阅。