We present a fully on-device and streaming Speech-To-Speech (STS) conversion model that normalizes a given input speech directly to synthesized output speech (a.k.a. Parrotron). Deploying such an end-to-end model locally on mobile devices pose significant challenges in terms of memory footprint and computation requirements. In this paper, we present a streaming-based approach to produce an acceptable delay, with minimal loss in speech conversion quality, when compared to a non-streaming server-based approach. Our approach consists of first streaming the encoder in real time while the speaker is speaking. Then, as soon as the speaker stops speaking, we run the spectrogram decoder in streaming mode along the side of a streaming vocoder to generate output speech in real time. To achieve an acceptable delay quality trade-off, we study a novel hybrid approach for look-ahead in the encoder which combines a look-ahead feature stacker with a look-ahead self-attention. We also compare the model with int4 quantization aware training and int8 post training quantization and show that our streaming approach is 2x faster than real time on the Pixel4 CPU.
翻译:我们展示了一种完全的在线和流式语音语音转换模式(STS),该模式使特定输入语音转换与合成输出语音(a.k.a.a.a.a.parrotron)直接正常化。在移动设备上本地部署这种端到端模式在记忆足迹和计算要求方面构成重大挑战。在本文中,我们展示了一种基于流式方法,以产生可接受的延迟,与非流式服务器的服务器方法相比,语音转换质量损失最小。我们的方法包括:在发言者发言时,实时将编码器流入第一个流出。然后,一旦发言者停止发言,我们就在流式电动语音转换模式的侧端运行流模式中的光谱解码器,实时生成输出语音发言。为了实现可接受的延迟质量交换,我们研究了一种新型的混合方法,在编码器中,将外观前观功能堆放器与外观头功能自控装置结合起来。我们还比较了模型,同时将了解内观的平位化方式与C节后方位化法比实时平流方法更快。