The performance of neural network-based speech enhancement systems is primarily influenced by the model architecture, whereas training times and computational resource utilization are primarily affected by training parameters such as the batch size. Since noisy and reverberant speech mixtures can have different duration, a batching strategy is required to handle variable size inputs during training, in particular for state-of-the-art end-to-end systems. Such strategies usually strive a compromise between zero-padding and data randomization, and can be combined with a dynamic batch size for a more consistent amount of data in each batch. However, the effect of these practices on resource utilization and more importantly network performance is not well documented. This paper is an empirical study of the effect of different batching strategies and batch sizes on the training statistics and speech enhancement performance of a Conv-TasNet, evaluated in both matched and mismatched conditions. We find that using a small batch size during training improves performance in both conditions for all batching strategies. Moreover, using sorted or bucket batching with a dynamic batch size allows for reduced training time and GPU memory usage while achieving similar performance compared to random batching with a fixed batch size.
翻译:以神经网络为基础的语音增强系统的性能主要受到模型结构的影响,而培训时间和计算资源利用则主要受到诸如批量规模等培训参数的影响;由于噪音和回声语音混合物可能具有不同的持续时间,需要在培训期间采取分批战略处理不同规模的投入,特别是最先进的端到端系统。这类战略通常力求在零版与数据随机化之间达成妥协,并可以与动态批量规模结合,每批中数据数量更加一致。然而,这些做法对资源利用的影响以及更重要的是网络性能的影响并没有很好地记录下来。本文是对不同批量战略和批量规模对Conv-TasNet的培训统计和语音增强性能的影响进行的经验研究,在匹配和不匹配的条件下进行评估。我们发现,在培训期间使用小批量改进了所有批量战略在两种条件下的性能。此外,使用有动态批量的分类或桶分批量,可以减少培训时间和GPU记忆的使用,同时实现与固定批量的随机批量的类似业绩。