Fully exploiting ad-hoc microphone networks for distant speech recognition is still an open issue. Empirical evidence shows that being able to select the best microphone leads to significant improvements in recognition without any additional effort on front-end processing. Current channel selection techniques either rely on signal, decoder or posterior-based features. Signal-based features are inexpensive to compute but do not always correlate with recognition performance. Instead decoder and posterior-based features exhibit better correlation but require substantial computational resources. In this work, we tackle the channel selection problem by proposing MicRank, a learning to rank framework where a neural network is trained to rank the available channels using directly the recognition performance on the training set. The proposed approach is agnostic with respect to the array geometry and type of recognition back-end. We investigate different learning to rank strategies using a synthetic dataset developed on purpose and the CHiME-6 data. Results show that the proposed approach is able to considerably improve over previous selection techniques, reaching comparable and in some instances better performance than oracle signal-based measures.
翻译:经验证据表明,如果能够选择最好的麦克风,无需在前端处理方面再作任何额外努力,就能大大改进识别。当前的频道选择技术要么依赖于信号、解码器或后方特征。信号基特征可以廉价地进行计算,但并不总是与识别性能相关。相反,解码器和后方的特征具有更好的关联性,但需要大量的计算资源。在这项工作中,我们通过提出MicRank(一个神经网络学习到排位框架)来解决频道选择问题,在这种框架下,通过直接在成套培训中进行识别性能的认证,对可用频道进行排位培训。提议的方法是对阵列几何和识别后端类型进行不可知性分析。我们调查不同学如何使用根据目的开发的合成数据集和CHiME-6数据对战略进行排序。结果显示,拟议的方法能够大大改进以前的选择技术,从而达到可比较性,在某些情况下比基于信号的措施更好。