We propose a system that transcribes the conversation of a typical meeting scenario that is captured by a set of initially unsynchronized microphone arrays at unknown positions. It consists of subsystems for signal synchronization, including both sampling rate and sampling time offset estimation, diarization based on speaker and microphone array position estimation, multi-channel speech enhancement, and automatic speech recognition. With the estimated diarization information, a spatial mixture model is initialized that is used to estimate beamformer coefficients for source separation. Simulations show that the speech recognition accuracy can be improved by synchronizing and combining multiple distributed microphone arrays compared to a single compact microphone array. Furthermore, the proposed informed initialization of the spatial mixture model delivers a clear performance advantage over random initialization.
翻译:我们建议建立一个系统,对典型会议情景的谈话进行笔录,由一组最初在未知位置上未同步的麦克风阵列记录下来,由信号同步的子系统组成,包括取样率和取样时间抵消估计,根据扬声器和麦克风阵列位置估计进行分层,多声道增强,以及自动语音识别。根据估计的分层信息,将空间混合模型初始化,用于估算源分离的光谱系数。模拟显示,通过同步和结合多个分布式麦克风阵列,与单一紧凑麦克风阵列相比,可以提高语音识别准确性。此外,拟议的空间混合模型在知情初始化方面,与随机初始化相比,具有明显的性能优势。