This study addresses the problem of single-channel Automatic Speech Recognition of a target speaker within an overlap speech scenario. In the proposed method, the hidden representations in the acoustic model are modulated by speaker auxiliary information to recognize only the desired speaker. Affine transformation layers are inserted into the acoustic model network to integrate speaker information with the acoustic features. The speaker conditioning process allows the acoustic model to perform computation in the context of target-speaker auxiliary information. The proposed speaker conditioning method is a general approach and can be applied to any acoustic model architecture. Here, we employ speaker conditioning on a ResNet acoustic model. Experiments on the WSJ corpus show that the proposed speaker conditioning method is an effective solution to fuse speaker auxiliary information with acoustic features for multi-speaker speech recognition, achieving +9% and +20% relative WER reduction for clean and overlap speech scenarios, respectively, compared to the original ResNet acoustic model baseline.
翻译:本研究涉及在重叠演讲情景下对目标演讲者进行单声道自动语音识别的问题。在拟议方法中,音响模型中的隐蔽表达方式由发言者辅助信息调节,只识别想要的发言者。在音响模型网络中插入了松动变异层,将音响信息与声学特征相结合。扩音调节程序允许音响模型在目标发言人辅助信息的背景下进行计算。拟议音响调节方法是一种一般方法,可以适用于任何音响模型结构。在这里,我们使用ResNet声学模型上的扬声器调节。WSJ机上的实验显示,与最初的ResNet声学模型基线相比,拟议的扬声器调节方法是一种有效的解决办法,可以将具有声学特征的语音辅助信息用于多声频语音识别,在清洁和重叠语音情景下分别实现+9%和+20%的相对WER减少。