Modern neural-network-based speech processing systems are typically required to be robust against reverberation, and the training of such systems thus needs a large amount of reverberant data. During the training of the systems, on-the-fly simulation pipeline is nowadays preferred as it allows the model to train on infinite number of data samples without pre-generating and saving them on harddisk. An RIR simulation method thus needs to not only generate more realistic artificial room impulse response (RIR) filters, but also generate them in a fast way to accelerate the training process. Existing RIR simulation tools have proven effective in a wide range of speech processing tasks and neural network architectures, but their usage in on-the-fly simulation pipeline remains questionable due to their computational complexity or the quality of the generated RIR filters. In this paper, we propose FRAM-RIR, a fast random approximation method of the widely-used image-source method (ISM), to efficiently generate realistic multi-channel RIR filters. FRAM-RIR bypasses the explicit calculation of sound propagation paths in ISM-based algorithms by randomly sampling the location and number of reflections of each virtual sound source based on several heuristic assumptions, while still maintains accurate direction-of-arrival (DOA) information of all sound sources. Visualization of oracle beampatterns and directional features shows that FRAM-RIR can generate more realistic RIR filters than existing widely-used ISM-based tools, and experiment results on multi-channel noisy speech separation and dereverberation tasks with a wide range of neural network architectures show that models trained with FRAM-RIR can also achieve on par or better performance on real RIRs compared to other RIR simulation tools with a significantly accelerated training procedure. A Python implementation of FRAM-RIR is released.
翻译:现代基于神经网络的语音处理系统通常需要对抗混响,因此这类系统的训练需要大量的混响数据。在系统训练过程中,即时模拟管道现在被认为是首选,因为它可以让模型在无限数量的数据样本上进行训练,而无需预先生成和保存它们在硬盘上。因此,一个RIR模拟方法不仅需要生成更逼真的人造房间冲激响应(RIR)滤波器,而且还需要以快速的方式生成它们以加速训练。现有的RIR模拟工具已经被证明在多种语音处理任务和神经网络架构中非常有效,但是它们在即时模拟管道中的使用仍然存在问题,因为它们的计算复杂度或生成的RIR滤波器的质量都不尽如人意。在本文中,我们提出了FRAM-RIR,一种快速随机逼近图像源方法(ISM)的方法,以高效地生成逼真的多通道RIR滤波器。FRAM-RIR通过基于几种启发式假设随机采样每个虚拟声源的位置和反射次数,避免了基于ISM算法中声音传播路径的显式计算,同时仍然保持了所有声源的准确方向到达(DOA)信息。神谕横向图案和方向特征的可视化显示,FRAM-RIR可以比现有的广泛使用的基于ISM的工具生成更逼真的RIR滤波器,并且使用各种神经网络架构进行多通道嘈杂语音分离和去混响任务的实验结果表明,使用FRAM-RIR训练的模型与其他RIR模拟工具在真实RIR上的表现相当甚至更好,并且训练过程大大加速。已发布FRAM-RIR的Python实现。