Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. While speech annotations at the language-specific phoneme or surface levels are readily available, annotations at a universal phone level are relatively rare and difficult to produce. In this work, we present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings with learnable weights represented using weighted finite-state transducers, which we call differentiable allophone graphs. By training multilingually, we build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language. These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language. We demonstrate the aforementioned benefits of our proposed framework with a system trained on 7 diverse languages.
翻译:建立通用语言识别系统需要制作可以在各语文之间共享的口声声学单位。虽然可以随时获得语言专用电话或表面水平的语音说明,但通用电话一级的说明相对较少,也很难制作。在这项工作中,我们提出了一个总框架,从仅使用加权有限状态传感器(我们称之为可互换的传声图)代表的具有可学习重量的电话记录和电话对电话绘图中获取电话一级的监督。我们通过多语种培训,建立了通用的基于电话的语音识别模型,为每种语文提供可解释的、可理解性电话对电话的制图。这些具有有知识的全声图的电话系统,语言学家可以用来记录新语言,建立基于电话的词汇,记录丰富的发音变异,并重新评价所见语言的全声图。我们用七种语言培训的系统展示了我们提议的框架的上述好处。