Text to speech (TTS) has been broadly used to synthesize natural and intelligible speech in different scenarios. Deploying TTS in various end devices such as mobile phones or embedded devices requires extremely small memory usage and inference latency. While non-autoregressive TTS models such as FastSpeech have achieved significantly faster inference speed than autoregressive models, their model size and inference latency are still large for the deployment in resource constrained devices. In this paper, we propose LightSpeech, which leverages neural architecture search~(NAS) to automatically design more lightweight and efficient models based on FastSpeech. We first profile the components of current FastSpeech model and carefully design a novel search space containing various lightweight and potentially effective architectures. Then NAS is utilized to automatically discover well performing architectures within the search space. Experiments show that the model discovered by our method achieves 15x model compression ratio and 6.5x inference speedup on CPU with on par voice quality. Audio demos are provided at https://speechresearch.github.io/lightspeech.
翻译:语音文本( TTS) 已被广泛用于在不同情况下合成自然和可理解的语音。 在移动电话或嵌入装置等各种终端设备中部署 TTS 需要极小的内存使用和推断时间。 FastSpeech 等非潜移 TTS 模型比自动递增模型的推导速度要快得多, 其模型大小和推导时间长度对于在资源受限设备中部署而言仍然很大。 在本文中, 我们提议 LightSpeech, 它可以利用神经结构搜索~ (NAS) 来自动设计更轻、更高效的基于快速语音的模型。 我们首先描述当前快速语音模型的组件, 并仔细设计包含各种光量和潜在有效结构的新搜索空间。 然后, NAS 被用于自动发现搜索空间内运行良好的结构。 实验显示, 我们方法发现的模型达到15x 模型压缩比率和 6.5x 引用速度, 以微语音质量提供音频演示。 https:// speechrestrearsearch.gio/ lightspech. lightschech. 。