Language models (LMs) for text data have been studied extensively for their usefulness in language generation and other downstream tasks. However, language modelling purely in the speech domain is still a relatively unexplored topic, with traditional speech LMs often depending on auxiliary text LMs for learning distributional aspects of the language. For the English language, these LMs treat words as atomic units, which presents inherent challenges to language modelling in the speech domain. In this paper, we propose a novel LSTM-based generative speech LM that is inspired by the CBOW model and built on linguistic units including syllables and phonemes. This offers better acoustic consistency across utterances in the dataset, as opposed to single melspectrogram frames, or whole words. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features. Through our experiments, we also highlight some well known, but poorly documented challenges in training generative speech LMs, including the mismatch between the supervised learning objective with which these models are trained such as Mean Squared Error (MSE), and the true objective, which is speech quality. Our experiments provide an early indication that while validation loss and Mel Cepstral Distortion (MCD) are not strongly correlated with generated speech quality, traditional text language modelling metrics like perplexity and next-token-prediction accuracy might be.
翻译:对文本数据的语言模型(LMS)进行了广泛的研究,以了解其对于语言生成和其他下游任务的有用性。然而,纯粹在语言领域的语言模型(LMS)对于文本数据的语言模型(LMS)已经进行了广泛的研究,以了解其对于语言生成和其他下游任务的有用性。然而,纯在语言领域的语言模型(LMS)对于语言数据的准确性而言,仍是一个相对未探讨的专题,传统语言模型(LMS)通常取决于辅助文本语言模块(LMS)对于学习语言的分布方面往往取决于辅助文本 LMS。对于英语,这些语言模型将单词作为原子单位处理,这给语言在语言模拟中的语言模型(CBOW)模型(CBLOW模型)和辅助性词词典特征(LMS)中的一些众所周知但记录不甚清晰的挑战(LMS(LMS), 而不是单一的MSB(MS)框架或整个词词词词词词组。如果数据组有限,那么我们所监督的模型(CMLMLALA)的精细度,那么,那么精确的校准的校准的模型,那么,我们的模型和精确的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校准的校准的校准的校准的校准的校准的校准的校准的校准则会的校准的校准的校准的校准的校准, 。