We present automatic speech recognition (ASR) systems for Tamil and Kannada based on subword modeling to effectively handle unlimited vocabulary due to the highly agglutinative nature of the languages. We explore byte pair encoding (BPE), and proposed a variant of this algorithm named extended-BPE, and Morfessor tool to segment each word as subwords. We have effectively incorporated maximum likelihood (ML) and Viterbi estimation techniques with weighted finite state transducers (WFST) framework in these algorithms to learn the subword dictionary from a large text corpus. Using the learnt subword dictionary, the words in training data transcriptions are segmented to subwords and we train deep neural network ASR systems which recognize subword sequence for any given test speech utterance. The output subword sequence is then post-processed using deterministic rules to get the final word sequence such that the actual number of words that can be recognized is much larger. For Tamil ASR, We use 152 hours of data for training and 65 hours for testing, whereas for Kannada ASR, we use 275 hours for training and 72 hours for testing. Upon experimenting with different combination of segmentation and estimation techniques, we find that the word error rate (WER) reduces drastically when compared to the baseline word-level ASR, achieving a maximum absolute WER reduction of 6.24% and 6.63% for Tamil and Kannada respectively.
翻译:我们根据亚字模型为泰米尔和坎纳达提供自动语音识别系统(ASR),以有效处理因语言高度混杂性而导致的无限词汇。我们探索了字对编码(BPE),并提出了这种算法的变式,名为扩展-BPE,和Morfessor工具,将每个字作为子词进行分行。我们在这些算法中有效地纳入了与加权有限国家传输器(WFFST)的最大化可能性(ML)和维泰比估算技术。我们用152小时的数据从大文本堆中学习子字典。在所学的子字典中,培训数据笔录中的词被分割为子字组,我们培训深神经网络ASR系统,这些系统在任何特定测试演讲时都承认子词序列。然后,产出子字序列的处理后,使用确定最后字序列,这样可以识别的实际字数要大得多。我们用152小时的数据从大文本体中学习小字典字典。在Kannada ASR中,我们用275小时用于培训和72小时的深度神经系统系统系统系统系统,在测试中分别用275小时和72小时,用来测试AER最高字节级水平,然后用我们找到一个分数级的缩缩缩缩缩缩缩缩缩缩缩的计算。