Recently Whisper has approached human-level robustness and accuracy in English speech recognition, while in minor language and mixed language speech recognition, there remains a compelling need for further improvement. In this work,we present the impressive results of Whisper-MCE, our fine-tuned Whisper, which was trainedusing our self-collected dataset, Mixed Cantoneseand English (MCE) audio dataset. Whisper-MCE achieved an impressive Mix Error Rate (MER) of 14.28%, which is 35.13% lower than the original model. It also achieved 12.61% Character Error Rate (CER) in Common voice zh-HK, positioning it as state-of-the-art. However, MER and CER pose challenges when it comes to evaluating its effectiveness in mixed-language and minor language contexts. We proposed a novel evaluation metric called FAL, which assesses an Automatic Speech Recognition (ASR) system based on fidelity to the original audio, accuracy, and latency. Whisper-MCE outperformed other models in this evaluation metric, achieving a score of 90.91 FAL, further highlighting its exceptional performance. The MCE dataset and code can be found at https://github.com/Shelton1013/Whisper MCE.
翻译:暂无翻译