Room Impulse Responses (RIRs) accurately characterize acoustic properties of indoor environments and play a crucial role in applications such as speech enhancement, speech recognition, and audio rendering in augmented reality (AR) and virtual reality (VR). Existing blind estimation methods struggle to achieve practical accuracy. To overcome this challenge, we propose the dynamic audio-room acoustic synthesis (DARAS) model, a novel deep learning framework that is explicitly designed for blind RIR estimation from monaural reverberant speech signals. First, a dedicated deep audio encoder effectively extracts relevant nonlinear latent space features. Second, the Mamba-based self-supervised blind room parameter estimation (MASS-BRPE) module, utilizing the efficient Mamba state space model (SSM), accurately estimates key room acoustic parameters and features. Third, the system incorporates a hybrid-path cross-attention feature fusion module, enhancing deep integration between audio and room acoustic features. Finally, our proposed dynamic acoustic tuning (DAT) decoder adaptively segments early reflections and late reverberation to improve the realism of synthesized RIRs. Experimental results, including a MUSHRA-based subjective listening study, demonstrate that DARAS substantially outperforms existing baseline models, providing a robust and effective solution for practical blind RIR estimation in real-world acoustic environments.
翻译:房间脉冲响应(RIR)精确表征室内环境的声学特性,在语音增强、语音识别以及增强现实(AR)和虚拟现实(VR)中的音频渲染等应用中起着关键作用。现有的盲估计方法难以达到实用精度。为克服这一挑战,我们提出了动态音频-房间声学合成(DARAS)模型,这是一种专为从单声道混响语音信号中进行盲RIR估计而设计的新型深度学习框架。首先,专用的深度音频编码器有效提取相关的非线性潜在空间特征。其次,基于Mamba的自监督盲房间参数估计(MASS-BRPE)模块利用高效的Mamba状态空间模型(SSM),准确估计关键房间声学参数与特征。第三,系统引入混合路径交叉注意力特征融合模块,增强音频特征与房间声学特征的深度融合。最后,我们提出的动态声学调谐(DAT)解码器自适应地分割早期反射与晚期混响,以提升合成RIR的真实感。实验结果(包括基于MUSHRA的主观听音测试)表明,DARAS显著优于现有基线模型,为真实声学环境中的实用盲RIR估计提供了鲁棒且有效的解决方案。