The neural transducer is an end-to-end model for automatic speech recognition (ASR). While the model is well-suited for streaming ASR, the training process remains challenging. During training, the memory requirements may quickly exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence lengths. In this work, we analyze the time and space complexity of a typical transducer training setup. We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample. We present optimizations to increase the efficiency and parallelism of the sample-wise method. In a set of thorough benchmarks, we show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched computation. As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.
翻译:神经传感器是自动语音识别的端到端模型。 虽然该模型非常适合流出自动语音识别, 但培训过程仍然具有挑战性。 在培训过程中, 记忆要求可能很快超过最先进的 GPU 能力, 限制批量大小和序列长度。 在此工作中, 我们分析典型的 Transport 培训设置的时间和空间复杂性。 我们提出了一个以样本计算传输器损失和梯度样本的记忆效率高的培训方法。 我们展示了提高样本方法的效率和平行性的最佳方法。 在一套彻底的基准中, 我们展示了我们的样本方法会显著减少记忆的使用, 并在与默认分批计算相比时以竞争性速度运行 。 作为突出的一点, 我们用仅使用 6 GB 的内存来计算导器损失和梯度的批量大小为 1024, 和 音频长度为 40 秒。