We present a neural vocoder designed with low-powered Alternative and Augmentative Communication devices in mind. By combining elements of successful modern vocoders with established ideas from an older generation of technology, our system is able to produce high quality synthetic speech at 48kHz on devices where neural vocoders are otherwise prohibitively complex. The system is trained adversarially using differentiable pitch synchronous overlap add, and reduces complexity by relying on pitch synchronous Inverse Short-Time Fourier Transform (ISTFT) to generate speech samples. Our system achieves comparable quality with a strong (HiFi-GAN) baseline while using only a fraction of the compute. We present results of a perceptual evaluation as well as an analysis of system complexity.
翻译:我们提出了一种由低功率的替代和辅助通信设备设计的神经电动电动电动电动器。通过将成功的现代电动电动电动电动电动电动电动电动器的要素与老一代技术的既定想法结合起来,我们的系统能够在48kHz的神经电动电动电动电动器非常复杂的设备上产生高质量的合成话语。这个系统经过了对抗性的培训,使用了不同的声势同步同步重叠添加,并依靠同步的阵列反短时傅里叶变换(ISTFT)生成语音样本,从而降低了复杂性。 我们的系统质量与强势(HiFi-GAN)基线相当,但只使用了计算器的一小部分。 我们介绍了感知性评估的结果以及对系统复杂性的分析。