How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. Leveraging only 10K text sentences, our SpeechLM gets a 16\% relative WER reduction over the best base model performance (from 6.8 to 5.7) on the public LibriSpeech ASR benchmark. Moreover, SpeechLM with fewer parameters even outperforms previous SOTA models on CoVoST-2 speech translation tasks. We also evaluate our SpeechLM on various spoken language processing tasks under the universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Our code and models are available at https://aka.ms/SpeechLM.
翻译:如何用文本数据来提升语言预培训前的文本数据是一个尚未解决的问题,因为语言和文本是截然不同的模式。 在本文中,我们提议了一个跨模式语言和语言模型(SpeechLM),以明确将语言和文本预培训与预定义的统一离散代表制统一。具体地说,我们引入了两个替代的离散象征器,以连接语言和文本模式,包括电话-单位和隐藏单位象征性符号,这可以通过少量配对语言-文本数据进行培训。根据经过培训的代号器,我们将未标的语音和文本数据转换为电话单位或隐藏单位的代号。培训前的目标是将语言和文本与一个统一的变异器网络统一统一到同一个独立的语义区区间。我们仅使用10K的文本句,我们的SULSUL语言模型在公共 LibriSpeech ASR基准中,其最佳基础性能(从6.8到5.7)的相对WER值减少16。此外,SpeLM(SERM)的参数更小,其参数甚至超越了电话语言单位单位的语音-SODAMASL 格式的改进任务。