In typical multi-talker speech recognition systems, a neural network-based acoustic model predicts senone state posteriors for each speaker. These are later used by a single-talker decoder which is applied on each speaker-specific output stream separately. In this work, we argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly. We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers. We employ a joint decoder that can make use of this uncertainty together with higher-level language information. For this, we revisit decoding algorithms used in factorial generative models in early multi-talker speech recognition systems. In contrast with these early works, we replace the GMM acoustic model with DNN, which provides greater modeling power and simplifies part of the inference. We demonstrate the advantage of joint decoding in proof of concept experiments on a mixed-TIDIGITS dataset.
翻译:在典型的多讲者语音识别系统中,一个基于神经网络的声学模型预测每个发言者都使用神经网络状态后台,这些后台后来被单独适用于每个发言者特定输出流的单一讲台解码器使用。在这项工作中,我们争辩说,这种计划是次优化的,并提出了一个原则性解决办法,将所有发言者都解码起来。我们修改了声学模型,以预测所有发言者的联合国家后台,使网络能够表达对部分语音信号归属给发言者的不确定性。我们使用一个联合解码器,可以使用这一不确定性与更高层次的语言信息一起使用。为此,我们重新审视在早期多讲台语音识别系统中用于保分质基因模型的解码算法。与这些早期工程相比,我们用DNN(GM)取代了GM声学模型,该模型提供了更大的建模能力,并简化了部分推理。我们展示了在混合的TIGITS数据集上进行概念实验的优势。