基于Wav2Vec2和转移学习的孟加拉语言自动语音识别系统 (An Automatic Speech Recognition System for Bengali Language based on Wav2Vec2 and Transfer Learning)

An independent, automated method of decoding and transcribing oral speech is known as automatic speech recognition (ASR). A typical ASR system extracts featured from audio recordings or streams and run one or more algorithms to map the features to corresponding texts. Numerous of research has been done in the field of speech signal processing in recent years. When given adequate resources, both conventional ASR and emerging end-to-end (E2E) speech recognition have produced promising results. However, for low-resource languages like Bengali, the current state of ASR lags behind, although the low resource state does not reflect upon the fact that this language is spoken by over 500 million people all over the world. Despite its popularity, there aren't many diverse open-source datasets available, which makes it difficult to conduct research on Bengali speech recognition systems. This paper is a part of the competition named `BUET CSE Fest DL Sprint'. The purpose of this paper is to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework. The proposed method effectively models the Bengali language and achieves 3.819 score in `Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.

翻译：一个典型的ASR系统从录音或流中提取了一种或多种算法来绘制相应的文本的特征。近年来,在语音信号处理领域进行了许多研究。当获得足够资源时,传统的ASR和新出现的端到端语音识别(E2E)都产生了有希望的结果。然而,对于孟加拉语这样的低资源语言来说,ASR的现状落后于现在,尽管资源低的国家没有反映全世界5亿多人使用这种语言这一事实。尽管这种语言受到欢迎,但并没有提供多种开放源数据集来绘制相应文本的特征。这使得很难对孟加拉语语音识别系统进行研究。本文是名为“BUET CSEE Fest DL Sprint”的竞赛的一部分。本文的目的是通过在E2E结构中采用语音识别技术来改进孟加拉语的语音识别表现。拟议的方法仅以传输学习框架为基础,在使用Balmarielain 样本进行Balisheal 1719数据测试时,才有效模拟了Blaslain amestrodustrical am amestal exestal as the squal amstrain sabal 19。