Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) when reading a sentence (target text) aloud. However, most existing alignment algorithms assume that the content involved in the two modalities is perfectly matched, thus leading to difficulty in locating such mismatch between speech and text. In this work, we develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences. More specifically, we propose a hierarchical Bayesian deep learning model, dubbed mismatch localization variational autoencoder (ML-VAE), which decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between the two modalities. Training such a model is very challenging due to the discrete latent variables with complex dependencies involved. To address this challenge, we propose a novel and effective training procedure that alternates between estimating the hard assignments of the discrete latent variables over a specifically designed mismatch localization finite-state acceptor (ML-FSA) and updating the parameters of neural networks. In this work, we focus on the mismatch localization problem for speech and text, and our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations for model training.
翻译:内容不匹配通常发生在将一种模式的数据转换成另一种模式时,例如,语言学习者在读一个句子(目标文本)时会大喊大叫,产生错误结果(言语中的错误)的语言学习者通常会发生内容不匹配的情况。然而,大多数现有的调整算法都假定两种模式所涉及的内容完全匹配,从而导致难以找到语言和文字之间的这种不匹配。在这项工作中,我们开发一种不受监督的学习算法,可以推断内容匹配的跨模式顺序数据之间的关系,特别是语言文本序列。更具体地说,我们建议一种等级化的巴伊西亚深层次学习模式,称为不匹配本地化的本地化自动变异模式(ML-VAE),将语言变异过程分解成分级结构的潜在变量,表明两种模式之间的关系。培训模式非常具有挑战性,因为存在离散的潜在变量,而且具有复杂的依赖性。为了应对这一挑战,我们建议一种新和有效的培训程序,在评估特定设计的本地变异的本地变异性(ML-FSA) 语言网络和新版本的变异性文本中,需要成功显示我们语言变异性的案文。