In this study, listeners of varied Indian nativities are asked to listen and recognize TIMIT utterances spoken by American speakers. We have three kinds of responses from each listener while they recognize an utterance: 1. Sentence difficulty ratings, 2. Speaker difficulty ratings, and 3. Transcription of the utterance. From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences.The sentences selected in this study are categorized into three groups: Easy, Medium and Hard, based on the frequency ofoccurrence of the words in them. We observe that the sentence, speaker difficulty ratings and the WERs increase from easy to hard categories of sentences. We also compare the human speech recognition performance with that using three automatic speech recognition (ASR) under following three combinations of acoustic model (AM) and language model(LM): ASR1) AM trained with recordings from speakers of Indian origin and LM built on TIMIT text, ASR2) AM using recordings from native American speakers and LM built ontext from LIBRI speech corpus, and ASR3) AM using recordings from native American speakers and LM build on LIBRI speech and TIMIT text. We observe that HSR performance is similar to that of ASR1 whereas ASR3 achieves the best performance. Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities
翻译:在本研究报告中,印度不同民族的听众被要求倾听并承认美国发言者讲的TIMIT言论。我们发现,每个听众的回答有三种,而他们的回答却有三种:1. 判决困难评分,2. 议长困难评分,3. 发音困难评分。从这些抄录中,计算出单词错误率,并用作衡量公认和原句相似性的标准。本研究报告选择的句子分为三组:容易、中和硬,其依据是这些词的频度。我们注意到,每个听众的评分、发言者困难评分和WER的评分从容易到硬的几类判决。我们还比较了三种自动语音识别(ASR),根据三种声音模型(AM)和语言模型(LM):ASR1, AM,以印度裔和LM(LIMIT)发言者的录音为基础,ASR2, AM,使用美国土著演讲人发言困难的录音和LBRI发言本中的LM,以及ASR3,使用美国土著演讲人最英级的成绩分析。