Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) when reading a sentence (target text) aloud. However, most existing alignment algorithms assume the content involved in the two modalities is perfectly matched and thus leading to difficulty in locating such mismatch between speech and text. In this work, we develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences. More specifically, we propose a hierarchical Bayesian deep learning model, named mismatch localization variational autoencoder (ML-VAE), that decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between the two modalities. Training such a model is very challenging due to the discrete latent variables with complex dependencies involved. We propose a novel and effective training procedure which estimates the hard assignments of the discrete latent variables over a specifically designed lattice and updates the parameters of neural networks alternatively. Our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations for model training.
翻译:内容不匹配通常发生在将一种模式的数据转换成另一种模式时,例如语言学习者在读一个句子(目标文本)时会产生错误的偏差(言语中的偏差),但大多数现有的调整算法假设两种模式所涉及的内容完全匹配,从而导致难以找到语言和文字之间的这种不匹配。在这项工作中,我们开发一种不受监督的学习算法,可以推断内容匹配的跨模式相继数据之间的关系,特别是语音-文字序列。更具体地说,我们提出一种等级级高的巴伊西亚深层学习模型,名为不匹配本地化变异自动coder(ML-VAE),将语言的基因化过程分解成分层结构化的潜伏变量,表明两种模式之间的关系。培训这样一个模型非常具有挑战性,因为存在离散的潜在变量,而且具有复杂的依赖性。我们提出了一个新的有效培训程序,用以估计离散潜伏变量在具体设计的拉蒂斯的硬度上的硬度分配,并另外更新了神经网络的参数。我们的实验模型显示,ML-VAE对文本的配置和图解需要成功定位。