Wav2Vec2.0 is a state-of-the-art model which learns speech representations through unlabeled speech data, aka, self supervised learning. The pretrained model is then fine tuned on small amounts of labeled data to use it for speech-to-text and machine translation tasks. Wav2Vec 2.0 is a transformative solution for low resource languages as it is mainly developed using unlabeled audio data. Getting large amounts of labeled data is resource intensive and especially challenging to do for low resource languages such as Swahilli, Tatar, etc. Furthermore, Wav2Vec2.0 word-error-rate(WER) matches or surpasses the very recent supervised learning algorithms while using 100x less labeled data. Given its importance and enormous potential in enabling speech based tasks on world's 7000 languages, it is key to evaluate the accuracy, latency and efficiency of this model on low resource and low power edge devices and investigate the feasibility of using it in such devices for private, secure and reliable speech based tasks. On-device speech tasks preclude sending audio data to the server hence inherently providing privacy, reduced latency and enhanced reliability. In this paper, Wav2Vec2.0 model's accuracy and latency has been evaluated on Raspberry Pi along with the KenLM language model for speech recognition tasks. How to tune certain parameters to achieve desired level of WER rate and latency while meeting the CPU, memory and energy budgets of the product has been discussed.
翻译:Wav2Vec2.0是一种最先进的模型,它通过未贴标签的语音数据、 aka、 自我监督的学习来学习语音表达。 预先培训的模型随后对少量的标签数据进行微调, 以便用于语音对文本和机器翻译任务。 Wav2Vec 2. 0 是低资源语言的变革性解决方案, 因为它主要是使用未贴标签的音频数据开发。 获取大量贴标签的数据是资源密集的, 对于Swahilli、 Tatar等低资源语言来说, 尤其具有挑战性。 此外, Wav2Vec2. 0 字色率(WER) 匹配或超过最近监管的学习算法, 而使用100x贴标签较少的数据。 Wav2 Vec 2. 由于它的重要性和巨大的潜力, 以全球7000种语言授权演讲的重要性、 关键是要评估这一模型的准确性、 以及低资源和低电位模型设备, 并调查在这种设备中使用它用于私人、 安全和可靠的语音任务的可行性。 在错误的演讲任务上, 无法向服务器发送所需的音频数据, 和精度的精度的精度分析, 如何在服务器上使用这种精度上, 和精度 和精度 和精度分析 和精度 和精度 和精度的精度能度 和精度上, 和精度能度能度 和精度能度 使得 和精度 和精度能度 使得 和精度能度 讨论 和精度 和精度 和精度 和精度能度 和精度 和精度能度分析 的磁度 。