The performance of neural network-based speech enhancement systems is primarily influenced by the model architecture, whereas training times and computational resource utilization are primarily affected by training parameters such as the batch size. Since noisy and reverberant speech mixtures can have different duration, a batching strategy is required to handle variable size inputs during training, in particular for state-of-the-art end-to-end systems. Such strategies usually strive for a compromise between zero-padding and data randomization, and can be combined with a dynamic batch size for a more consistent amount of data in each batch. However, the effect of these strategies on resource utilization and more importantly network performance is not well documented. This paper systematically investigates the effect of different batching strategies and batch sizes on the training statistics and speech enhancement performance of a Conv-TasNet, evaluated in both matched and mismatched conditions. We find that using a small batch size during training improves performance in both conditions for all batching strategies. Moreover, using sorted or bucket batching with a dynamic batch size allows for reduced training time and GPU memory usage while achieving similar performance compared to random batching with a fixed batch size.
翻译:针对训练端到端语音增强系统的可变尺寸输入的批处理研究
神经网络为基础的语音增强系统的性能主要受到模型架构的影响,而训练时间和计算资源利用率则主要受到批处理大小等训练参数的影响。由于嘈杂和混响的语音混合物可能具有不同的持续时间,因此需要一种批处理策略来处理训练期间的可变尺寸输入,特别是用于最先进的端到端系统。这些策略通常力求在零填充和数据随机化之间取得平衡,并可以与动态批处理大小相结合,以获得更一致的每个批次的数据量。然而,这些策略对资源利用和更重要的网络性能的影响还没有得到很好的文献证明。本文系统地研究了不同批处理策略和批处理大小对Conv-TasNet的训练统计数据和语音增强性能的影响,在匹配和不匹配条件下进行评估。我们发现,在训练期间使用小批量大小可以改善两种条件下所有批处理策略的性能。此外,使用动态批处理大小的排序或桶式批处理可以减少训练时间和GPU内存使用,同时在固定批处理大小的随机批处理中实现类似的性能。