Phoneme recognition is a largely unsolved problem in NLP, especially for low-resource languages like Urdu. The systems that try to extract the phonemes from audio speech require hand-labeled phonetic transcriptions. This requires expert linguists to annotate speech data with its relevant phonetic representation which is both an expensive and a tedious task. In this paper, we propose STRATA, a framework for supervised phoneme recognition that overcomes the data scarcity issue for low resource languages using a seq2seq neural architecture integrated with transfer learning, attention mechanism, and data augmentation. STRATA employs transfer learning to reduce the network loss in half. It uses attention mechanism for word boundaries and frame alignment detection which further reduces the network loss by 4% and is able to identify the word boundaries with 92.2% accuracy. STRATA uses various data augmentation techniques to further reduce the loss by 1.5% and is more robust towards new signals both in terms of generalization and accuracy. STRATA is able to achieve a Phoneme Error Rate of 16.5% and improves upon the state of the art by 1.1% for TIMIT dataset (English) and 11.5% for CSaLT dataset (Urdu).
翻译:在NLP中,电话识别在很大程度上是一个未解决的问题,特别是对于Urdu这样的低资源语言。试图从音频语音中提取电话的系统需要手贴标签的电话抄录。这要求专业语言专家用相关语音表达方式来说明语音数据,这是一个昂贵和烦琐的任务。在本文中,我们提议STRATA,这是一个监督语音识别的框架,它利用与传输学习、关注机制和数据增强相结合的后继2等神经结构克服了低资源语言的数据稀缺问题。STRATA利用传输学习来减少网络损失的一半。它使用注意机制来测量字的字框界限和框架对齐,从而进一步将网络损失减少4%,并能够准确度为92.2%确定字框界限。STRATA使用各种数据增强技术进一步将损失减少1.5%,并且从一般化和准确性角度对新信号更加有力。STRATA能够达到16.5%的电话错误率,并将艺术状态改善到1.1%的TIMIT数据(Cset) (英美第5 %)。