Training the state-of-the-art speech-to-text (STT) models in mobile devices is challenging due to its limited resources relative to a server environment. In addition, these models are trained on generic datasets that are not exhaustive in capturing user-specific characteristics. Recently, on-device personalization techniques have been making strides in mitigating the problem. Although many current works have already explored the effectiveness of on-device personalization, the majority of their findings are limited to simulation settings or a specific smartphone. In this paper, we develop and provide a detailed explanation of our framework to train end-to-end models in mobile phones. To make it simple, we considered a model based on connectionist temporal classification (CTC) loss. We evaluated the framework on various mobile phones from different brands and reported the results. We provide enough evidence that fine-tuning the models and choosing the right hyperparameter values is a trade-off between the lowest WER achievable, training time on-device, and memory consumption. Hence, this is vital for a successful deployment of on-device training onto a resource-limited environment like mobile phones. We use training sets from speakers with different accents and record a 7.6% decrease in average word error rate (WER). We also report the associated computational cost measurements with respect to time, memory usage, and cpu utilization in mobile phones in real-time.
翻译:在移动设备中培训最先进的语音到文字模型(STT)因其相对于服务器环境的资源有限而具有挑战性。此外,这些模型在通用数据集方面受过培训,这些数据集在捕捉用户特性方面并非详尽无遗。最近,在线个人化技术在缓解这一问题方面迈出了一大步。虽然许多当前工作已经探索了在设备上个人化的有效性,但其大部分发现都局限于模拟设置或特定智能手机。因此,我们制定并详细解释我们培训移动电话端到端模型的框架。为了简单化,我们考虑了基于连接时间分类(CTC)损失的模型。我们评估了不同品牌的各种移动电话框架并报告了结果。我们提供了足够证据,表明微调模型和选择正确的超光谱值值是最低WER可实现率、在设备上培训时间和记忆消耗之间的利弊。因此,为了成功地在像移动电话这样的资源有限环境中部署在线培训模式至关重要。为了简单化,我们考虑了基于连接时间分类(CT)损失的模型。我们评估了不同品牌的各种移动电话使用率,并报告了结果。我们用不同的存储器中所使用的存储率和存储率。