Recent advances in multimodal deep learning have greatly enhanced the capability of systems for speech analysis and pronunciation assessment. Accurate pronunciation detection remains a key challenge in Arabic, particularly in the context of Quranic recitation, where subtle phonetic differences can alter meaning. Addressing this challenge, the present study proposes a transformer-based multimodal framework for Arabic phoneme mispronunciation detection that combines acoustic and textual representations to achieve higher precision and robustness. The framework integrates UniSpeech-derived acoustic embeddings with BERT-based textual embeddings extracted from Whisper transcriptions, creating a unified representation that captures both phonetic detail and linguistic context. To determine the most effective integration strategy, early, intermediate, and late fusion methods were implemented and evaluated on two datasets containing 29 Arabic phonemes, including eight hafiz sounds, articulated by 11 native speakers. Additional speech samples collected from publicly available YouTube recordings were incorporated to enhance data diversity and generalization. Model performance was assessed using standard evaluation metrics: accuracy, precision, recall, and F1-score, allowing a detailed comparison of the fusion strategies. Experimental findings show that the UniSpeech-BERT multimodal configuration provides strong results and that fusion-based transformer architectures are effective for phoneme-level mispronunciation detection. The study contributes to the development of intelligent, speaker-independent, and multimodal Computer-Aided Language Learning (CALL) systems, offering a practical step toward technology-supported Quranic pronunciation training and broader speech-based educational applications.
翻译:多模态深度学习的最新进展显著提升了语音分析与发音评估系统的能力。在阿拉伯语中,尤其是在古兰经诵读的语境下,细微的语音差异可能改变语义,因此准确的发音检测仍是一个关键挑战。为应对这一挑战,本研究提出了一种基于Transformer的多模态框架,用于阿拉伯语音素发音错误检测,该框架结合声学和文本表征以实现更高的精度与鲁棒性。该框架将UniSpeech衍生的声学嵌入与从Whisper转录中提取的基于BERT的文本嵌入相集成,创建了一种能同时捕捉语音细节和语言上下文的统一表征。为确定最有效的集成策略,我们在包含29个阿拉伯语音素(包括8个哈菲兹音)的数据集上实现并评估了早期、中期和晚期融合方法,这些音素由11名母语者发音。此外,还引入了从公开可用的YouTube录音中收集的额外语音样本,以增强数据多样性和泛化能力。模型性能使用标准评估指标进行评估:准确率、精确率、召回率和F1分数,从而对融合策略进行详细比较。实验结果表明,UniSpeech-BERT多模态配置提供了强劲的结果,且基于融合的Transformer架构在音素级发音错误检测中表现有效。本研究有助于开发智能的、与说话者无关的多模态计算机辅助语言学习(CALL)系统,为技术支持的《古兰经》发音训练及更广泛的基于语音的教育应用迈出了实用的一步。