Low latency speech human-machine communication is becoming increasingly necessary as speech technology advances quickly in the last decade. One of the primary factors behind the advancement of speech technology is self-supervised learning. Most self-supervised learning algorithms are designed with full utterance assumption and compromises have to made if partial utterances are presented, which are common in the streaming applications. In this work, we propose a chunk based self-supervised learning (Chunk SSL) algorithm as an unified solution for both streaming and offline speech pre-training. Chunk SSL is optimized with the masked prediction loss and an acoustic encoder is encouraged to restore indices of those masked speech frames with help from unmasked frames in the same chunk and preceding chunks. A copy and append data augmentation approach is proposed to conduct efficient chunk based pre-training. Chunk SSL utilizes a finite scalar quantization (FSQ) module to discretize input speech features and our study shows a high resolution FSQ codebook, i.e., a codebook with vocabulary size up to a few millions, is beneficial to transfer knowledge from the pre-training task to the downstream tasks. A group masked prediction loss is employed during pre-training to alleviate the high memory and computation cost introduced by the large codebook. The proposed approach is examined in two speech to text tasks, i.e., speech recognition and speech translation. Experimental results on the \textsc{Librispeech} and \textsc{Must-C} datasets show that the proposed method could achieve very competitive results for speech to text tasks at both streaming and offline modes.
翻译:随着近十年来语音技术的快速发展,低延迟语音人机交互的需求日益增长。自监督学习是推动语音技术进步的主要因素之一。大多数自监督学习算法基于完整话语假设设计,当处理流式应用中常见的部分话语时,必须做出妥协。本研究提出一种基于分块的自监督学习算法,为流式与离线语音预训练提供统一解决方案。该算法通过掩码预测损失进行优化,鼓励声学编码器在相同分块及前序分块未掩码帧的辅助下,恢复被掩码语音帧的索引。我们提出复制追加数据增强方法以实现高效的分块预训练。该算法采用有限标量量化模块对输入语音特征进行离散化处理,研究表明高分辨率有限标量量化码本对预训练任务向下游任务的知识迁移具有促进作用。为缓解大码本带来的高内存与计算成本,预训练阶段采用分组掩码预测损失。所提方法在语音识别和语音翻译两项语音转文本任务中进行验证。在\textsc{Librispeech}和\textsc{Must-C}数据集上的实验结果表明,该方法在流式与离线模式下均能取得极具竞争力的语音转文本性能。