Automatic Speech Recognition (ASR) holds immense potential to assist in clinical documentation and patient report generation, particularly in resource-constrained regions. However, deployment is currently hindered by a technical deadlock: a severe "Reality Gap" between laboratory performance and noisy, real-world clinical audio, coupled with strict privacy and resource constraints. We quantify this gap, showing that a robust multilingual model (IndicWav2Vec) degrades to a 40.94% WER on rural clinical data from India, rendering it unusable. To address this, we explore a zero-data-exfiltration framework enabling localized, continual adaptation via Low-Rank Adaptation (LoRA). We conduct a rigorous investigative study of continual learning strategies, characterizing the trade-offs between data-driven and parameter-driven stability. Our results demonstrate that multi-domain Experience Replay (ER) yields the primary performance gains, achieving a 17.1% relative improvement in target WER and reducing catastrophic forgetting by 55% compared to naive adaptation. Furthermore, we observed that standard Elastic Weight Consolidation (EWC) faced numerical stability challenges when applied to LoRA in noisy environments. Our experiments show that a stabilized, linearized formulation effectively controls gradient magnitudes and enables stable convergence. Finally, we verify via a domain-specific spot check that acoustic adaptation is a fundamental prerequisite for usability which cannot be bypassed by language models alone.
翻译:自动语音识别(ASR)在临床文档记录和患者报告生成方面具有巨大潜力,在资源受限地区尤其如此。然而,当前部署受到技术僵局的阻碍:实验室性能与嘈杂的真实临床音频之间存在严重的"现实鸿沟",同时面临严格的隐私和资源限制。我们量化了这一鸿沟,研究表明稳健的多语言模型(IndicWav2Vec)在印度乡村临床数据上的词错误率(WER)恶化至40.94%,导致其无法使用。为解决此问题,我们探索了一种零数据外泄框架,通过低秩自适应(LoRA)实现本地化持续适应。我们对持续学习策略进行了严格的实证研究,系统分析了数据驱动稳定性与参数驱动稳定性之间的权衡关系。实验结果表明,多领域经验回放(ER)能带来主要的性能提升,与原始自适应方法相比,目标WER获得17.1%的相对改善,并将灾难性遗忘降低了55%。此外,我们观察到标准弹性权重固化(EWC)在噪声环境中应用于LoRA时面临数值稳定性挑战。实验证明,采用稳定化的线性化公式能有效控制梯度幅值并实现稳定收敛。最后,通过领域特异性抽样验证,我们确认声学自适应是可用性的根本前提,仅凭语言模型无法绕过这一要求。