All-neural, end-to-end ASR systems gained rapid interest from the speech recognition community. Such systems convert speech input to text units using a single trainable neural network model. E2E models require large amounts of paired speech text data that is expensive to obtain. The amount of data available varies across different languages and dialects. It is critical to make use of all these data so that both low resource languages and high resource languages can be improved. When we want to deploy an ASR system for a new application domain, the amount of domain specific training data is very limited. To be able to leverage data from existing domains is important for ASR accuracy in the new domain. In this paper, we treat all these aspects as categorical information in an ASR system, and propose a simple yet effective way to integrate categorical features into E2E model. We perform detailed analysis on various training strategies, and find that building a joint model that includes categorical features can be more accurate than multiple independently trained models.
翻译:语音识别群落对全新、端到端 ASR 系统迅速产生兴趣。 这种系统使用单一的可训练神经网络模型将语音输入转换成文本单元。 E2E 模型需要大量昂贵的配对语音文本数据才能获取。 不同语言和方言可获得的数据数量各不相同。 关键是要利用所有这些数据, 以便改善低资源语言和高资源语言。 当我们想为新的应用域部署 ASR 系统时, 域特定培训数据的数量非常有限。 能否利用现有域的数据对ASR 在新域域中的准确性很重要。 在本文中, 我们把这些方面都作为绝对信息在 ASR 系统中处理, 并提出一个简单而有效的方法, 将绝对特征纳入 E2E 模型。 我们对各种培训战略进行详细分析, 发现建立一个包含直截面特征的联合模型比多个独立培训模型更准确。