End-to-end automatic speech recognition directly maps input speech to characters. However, the mapping can be problematic when several different pronunciations should be mapped into one character or when one pronunciation is shared among many different characters. Japanese ASR suffers the most from such many-to-one and one-to-many mapping problems due to Japanese kanji characters. To alleviate the problems, we introduce explicit interaction between characters and syllables using Self-conditioned connectionist temporal classification (CTC), in which the upper layers are ``self-conditioned'' on the intermediate predictions from the lower layers. The proposed method utilizes character-level and syllable-level intermediate predictions as conditioning features to deal with mutual dependency between characters and syllables. Experimental results on Corpus of Spontaneous Japanese show that the proposed method outperformed the conventional multi-task and Self-conditioned CTC methods.
翻译:端到端自动语音识别直接映射字符的输入语句。 但是,如果将几个不同的发音映射成一个字符,或者当一个发音由许多不同字符共享时,制图可能会有问题。 日本的自动语音识别因日本的袋式字符而最受如此多到一和一到多个映射问题的影响。为了缓解问题,我们采用自成一体的连接时间分类(CTC)在字符和音调之间引入明确的互动关系。 在这种分类中,上层在下层的中间预测中“有自定条件 ” 。 拟议的方法利用字符级和音级中间预测作为处理字符和音调之间相互依赖的调节特征。 日本自发性软体的实验结果显示,拟议方法超过了常规的多任务和自设的CTC方法。</s>