Multilingual automatic speech recognition (ASR) systems mostly benefit low resource languages but suffer degradation in performance across several languages relative to their monolingual counterparts. Limited studies have focused on understanding the languages behaviour in the multilingual speech recognition setups. In this paper, a novel data-driven approach is proposed to investigate the cross-lingual acoustic-phonetic similarities. This technique measures the similarities between posterior distributions from various monolingual acoustic models against a target speech signal. Deep neural networks are trained as mapping networks to transform the distributions from different acoustic models into a directly comparable form. The analysis observes that the languages closeness can not be truly estimated by the volume of overlapping phonemes set. Entropy analysis of the proposed mapping networks exhibits that a language with lesser overlap can be more amenable to cross-lingual transfer, and hence more beneficial in the multilingual setup. Finally, the proposed posterior transformation approach is leveraged to fuse monolingual models for a target language. A relative improvement of ~8% over monolingual counterpart is achieved.
翻译:多语言自动语音识别(ASR)系统主要有利于低资源语言,但相对于单一语言语言而言,多种语言语言的功能下降。有限研究侧重于理解多语言语音识别设置中的语言行为。在本文件中,提议采用新的数据驱动方法,调查跨语言声传话相似性。这一技术测量了各种单语声学模型的后方分布与目标语音信号之间的相似性。深神经网络作为绘图网络接受了培训,将不同声传模型的分布转化为直接可比的形式。分析发现,语言的接近性无法真正用重叠的电话设置的数量来估计。对拟议绘图网络的复读分析显示,相对较少重叠的语言更便于跨语言传输,因此在多语种设置中也更有利。最后,拟议的后方语转换方法被用来结合目标语言的单语模式。实现了相对于单语对应语言的~8%的相对改善。