Voice conversion models have developed for decades, and current mainstream research focuses on non-streaming voice conversion. However, streaming voice conversion is more suitable for practical application scenarios than non-streaming voice conversion. In this paper, we propose a streaming any-to-many voice conversion based on fully non-autoregressive model, which includes a streaming transformer based acoustic model and a streaming vocoder. Streaming transformer based acoustic model is composed of a pre-trained encoder from streaming end-to-end based automatic speech recognition model and a decoder modified on FastSpeech blocks. Streaming vocoder is designed for streaming task with pseudo quadrature mirror filter bank and causal convolution. Experimental results show that the proposed method achieves significant performance both in latency and conversion quality and can be real-time on CPU and GPU.
翻译:声音转换模型已经开发了几十年,而目前的主流研究侧重于非流音语音转换。 然而,流声转换比非流音转换更适合实际应用情景。 在本文中,我们提议基于完全非反向模式,包括流变变变变变变变变变变换模型和流变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变。 流变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变的变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变