We present a fully on-device and streaming Speech-To-Speech conversion model that normalizes a given input speech directly to synthesized output speech (a.k.a. Parrotron). Deploying such an end-to-end model locally on mobile devices pose significant challenges in terms of memory footprint and computation requirements. In this paper, we present a streaming-based approach to produce an acceptable delay, with minimal loss in speech conversion quality, when compared to a reference state of the art non-streaming approach. Our method consists of first streaming the encoder in real time while the speaker is speaking. Then, as soon as the speaker stops speaking, we run the spectrogram decoder in streaming mode along the side of a streaming vocoder to generate output speech in real time. To achieve an acceptable delay-quality trade-off, we propose a novel hybrid approach for look-ahead in the encoder which combines a look-ahead feature stacker with a look-ahead self-attention. We also compare the model with int4 quantization aware training and int8 post training quantization and show that our streaming approach is 2x faster than real time on the Pixel4 CPU.
翻译:在移动设备上部署这种端对端模型对记忆足迹和计算要求构成重大挑战。在本文中,我们提出了一个基于流的方法,以产生可接受的延迟,与非流方式的参考状态相比,语音转换质量损失最小,以产生可接受的延迟。我们的方法是在演讲者发言时实时第一次流出编码器。然后,当演讲者停止发言时,我们就在流动模式的一侧运行流动模式的光谱解码器,实时生成输出。为了实现可接受的延迟质量交易,我们提出一种新型的混合方法,在编码器中将外观特征堆叠与直观自控结合起来。我们还将模型与4 孔化意识培训和英特加码2号后Cxion方法的同步化和显示我们2号实时同步的流流化,显示我们比2号磁盘化更快的磁盘化。