This paper presents the details of the SRIB-LEAP submission to the ConferencingSpeech challenge 2021. The challenge involved the task of multi-channel speech enhancement to improve the quality of far field speech from microphone arrays in a video conferencing room. We propose a two stage method involving a beamformer followed by single channel enhancement. For the beamformer, we incorporated self-attention mechanism as inter-channel processing layer in the filter-and-sum network (FaSNet), an end-to-end time-domain beamforming system. The single channel speech enhancement is done in log spectral domain using convolution neural network (CNN)-long short term memory (LSTM) based architecture. We achieved improvements in objective quality metrics - perceptual evaluation of speech quality (PESQ) of 0.5 on the noisy data. On subjective quality evaluation, the proposed approach improved the mean opinion score (MOS) by an absolute measure of 0.9 over the noisy audio.
翻译:本文介绍了SRIB-LEAP向Conference Speech 挑战2021提交的意见书的详细内容。挑战涉及多频道语音增强任务,以提高一个电视会议室麦克风阵列远场语音的质量。我们建议采用两阶段方法,先使用光束,然后加强单一频道。对光谱仪,我们将自留机制作为频道间处理层纳入过滤器和总线网络(FASNet),即终端至终端时空成形系统。单一频道语音增强是在日冕光谱域中,使用动态神经网络(CNN)的短期内存(LSTM)结构。我们改进了客观质量指标――对声音质量0.5的感知性评价(PESQ),对噪音数据进行了0.5的感知性评价。在主观质量评价方面,提议的方法改进了平均意见评分(MOS),对噪音进行了0.9的绝对计量。