End-to-end multilingual speech recognition involves using a single model training on a compositional speech corpus including many languages, resulting in a single neural network to handle transcribing different languages. Due to the fact that each language in the training data has different characteristics, the shared network may struggle to optimize for all various languages simultaneously. In this paper we propose a novel multilingual architecture that targets the core operation in neural networks: linear transformation functions. The key idea of the method is to assign fast weight matrices for each language by decomposing each weight matrix into a shared component and a language dependent component. The latter is then factorized into vectors using rank-1 assumptions to reduce the number of parameters per language. This efficient factorization scheme is proved to be effective in two multilingual settings with $7$ and $27$ languages, reducing the word error rates by $26\%$ and $27\%$ rel. for two popular architectures LSTM and Transformer, respectively.
翻译:终端到终端多语种语音识别涉及使用单一的单一示范培训,培训内容包括多种语言的构成语言,形成单一神经网络,处理不同语言的翻译工作,由于培训数据中每种语言具有不同的特点,共享网络可能难以同时优化所有各种语言的优化,在本文件中,我们提议了一个针对神经网络核心操作的新颖的多语种架构:线性转换功能,方法的关键思想是将每个重量矩阵分解成一个共享部分和一个依赖语言的部分,为每种语言指定快速加权矩阵,然后将后者纳入矢量中,使用第一级假设来减少每种语言参数的数量,这一高效的因数化计划在两种多语种环境中证明有效,使用7美元和27美元的语言,将两种流行的LSTM和变异器的单词误率分别降低26美元和27美元。