End-to-end models are fast replacing the conventional hybrid models in automatic speech recognition. Transformer, a sequence-to-sequence model, based on self-attention popularly used in machine translation tasks, has given promising results when used for automatic speech recognition. This paper explores different ways of incorporating speaker information at the encoder input while training a transformer-based model to improve its speech recognition performance. We present speaker information in the form of speaker embeddings for each of the speakers. We experiment using two types of speaker embeddings: x-vectors and novel s-vectors proposed in our previous work. We report results on two datasets a) NPTEL lecture database and b) Librispeech 500-hour split. NPTEL is an open-source e-learning portal providing lectures from top Indian universities. We obtain improvements in the word error rate over the baseline through our approach of integrating speaker embeddings into the model.
翻译:在自动语音识别中,端到端模型正在迅速取代常规混合模式。 以机器翻译任务中普遍使用的自我关注为基础的从顺序到顺序模式“变换器”在自动语音识别中使用时取得了令人乐观的成果。 本文探索了将演讲者信息纳入编码输入的不同方法,同时培训了一个以变压器为基础的模型,以改进语音识别性能。 我们以为每个发言者嵌入语音的形式介绍了演讲者的信息。 我们试验了两种类型的演讲者嵌入内容:X-矢量器和我们先前工作中提议的新型插播器。 我们报告了两个数据集的结果 a) NPTEL讲座数据库和 b) Librispeech 500小时的拆分。 NPTEL是一个开放源电子学习门户,向印度顶级大学提供讲座。 我们通过将演讲者嵌入模型的方法,改进了基线上的字差率。