We propose the first method to adaptively modify the duration of a given speech signal. Our approach uses a Bayesian framework to define a latent attention map that links frames of the input and target utterances. We train a masked convolutional encoder-decoder network to produce this attention map via a stochastic version of the mean absolute error loss function; our model also predicts the length of the target speech signal using the encoder embeddings. The predicted length determines the number of steps for the decoder operation. During inference, we generate the attention map as a proxy for the similarity matrix between the given input speech and an unknown target speech signal. Using this similarity matrix, we compute a warping path of alignment between the two signals. Our experiments demonstrate that this adaptive framework produces similar results to dynamic time warping, which relies on a known target signal, on both voice conversion and emotion conversion tasks. We also show that our technique results in a high quality of generated speech that is on par with state-of-the-art vocoders.
翻译:我们提出适应性修改特定语音信号持续时间的第一个方法。 我们的方法是使用一个巴伊西亚框架来定义潜在关注图, 该图将输入和目标语句的框链接起来。 我们训练了一个蒙面的 Convolution 编码器- 解码器网络, 通过一个平均绝对错误丢失功能的随机化版本来生成此关注图; 我们的模型还预测了使用编码器嵌入功能的目标语音信号长度。 预测长度决定了解码器操作的步骤数量。 在推断过程中, 我们生成了关注图, 以作为特定输入语句和未知目标言语信号之间的相似性矩阵的代理。 我们使用这个相似性矩阵, 我们计算了两种信号之间的对齐路径。 我们的实验表明, 这个适应性框架产生了类似的动态时间扭曲结果, 它依赖于已知的目标信号, 即语音转换和情感转换任务。 我们还显示, 我们的技术结果是高质量生成的演讲, 与最先进的vocoders一样。