Blind acoustic parameter estimation consists in inferring the acoustic properties of an environment from recordings of unknown sound sources. Recent works in this area have utilized deep neural networks trained either partially or exclusively on simulated data, due to the limited availability of real annotated measurements. In this paper, we study whether a model purely trained using a fast image-source room impulse response simulator can generalize to real data. We present an ablation study on carefully crafted simulated training sets that account for different levels of realism in source, receiver and wall responses. The extent of realism is controlled by the sampling of wall absorption coefficients and by applying measured directivity patterns to microphones and sources. A state-of-the-art model trained on these datasets is evaluated on the task of jointly estimating the room's volume, total surface area, and octave-band reverberation times from multiple, multichannel speech recordings. Results reveal that every added layer of simulation realism at train time significantly improves the estimation of all quantities on real signals.
翻译:盲人声学参数估计包括从未知声源的录音中推断出环境的声学特性; 由于实际附加说明的测量数据有限,该领域最近的工作利用了部分或专门以模拟数据为主的深神经网络; 在本文中,我们研究的是,使用快速图像源室脉冲反应模拟器进行纯培训的模型能否概括为真实数据; 我们对精心制作的模拟培训成套材料进行了模拟研究,其中考虑到源、接收器和壁响应中不同程度的现实主义; 通过抽取墙吸收系数和对麦克风和源应用测量的直流模式来控制现实主义的程度; 对关于这些数据集的先进模型进行了评估,以共同估计房间的体积、全部表面面积以及从多个多声道语音录音中反动八波段的频率。结果显示,每增加一层模拟真实主义在列时都大大改进了对真实信号的所有数量的估计。