This paper presents a novel speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra by neural networks. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is composed of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. Experimental results show that our proposed neural speech phase prediction model outperforms the iterative Griffin-Lim algorithm and other neural network-based method, in terms of both reconstructed speech quality and generation speed.
翻译:本文介绍了一个新的语音阶段预测模型,该模型直接从神经网络的振幅光谱中预测包裹的相光谱。 拟议的模型是一个残余的卷变网络和平行估算结构的级联。 平行估算结构由两个平行的线性卷变层和一个阶段计算公式组成, 仿照从复杂光谱真实和想象部分计算相光谱的过程, 并严格将预测的相光值限制在主要值间隔。 为了避免由阶段包装引起的错误扩展问题, 我们设计了反包装培训损失, 由预测的包裹相光谱与自然相光谱界定, 其方法是用反包装功能激活瞬时相错误、 群延错误和瞬时角频率错误。 实验结果显示,我们拟议的神经语音阶段预测模型在重建语音质量和生成速度两方面都超越了迭接的Grif- Lim算法和其他以神经网络为基础的方法。