Often, the storage and computational constraints of embeddeddevices demand that a single on-device ASR model serve multiple use-cases / domains. In this paper, we propose aFlexibleTransducer(FlexiT) for on-device automatic speech recognition to flexibly deal with multiple use-cases / domains with different accuracy and latency requirements. Specifically, using a single compact model, FlexiT provides a fast response for voice commands, and accurate transcription but with more latency for dictation. In order to achieve flexible and better accuracy and latency trade-offs, the following techniques are used. Firstly, we propose using domain-specific altering of segment size for Emformer encoder that enables FlexiT to achieve flexible de-coding. Secondly, we use Alignment Restricted RNNT loss to achieve flexible fine-grained control on token emission latency for different domains. Finally, we add a domain indicator vector as an additional input to the FlexiT model. Using the combination of techniques, we show that a single model can be used to improve WERs and real time factor for dictation scenarios while maintaining optimal latency for voice commands use-cases
翻译:嵌入装置的存储和计算限制往往要求单一的安装式 ASR 模型为多个使用大小/ 域服务。 在本文中,我们建议使用一个软性自动语音识别器(FlexiT),用于灵活处理多个使用案例/域,其精确性和延迟性要求不同。具体地说,FlexiT为语音命令和准确的转录提供快速响应,但为听写提供更深长的时空。为了实现灵活和更好的准确性和延缓性交换,我们采用了以下技术。首先,我们建议对 Emexion 编码器使用特定域的分区大小改变,使FlexiT 能够实现灵活的解码。第二,我们使用“调整限制 RNNT 损失” 来对不同域的象征性排放悬浮进行灵活的微缩控制。最后,我们增加了一个域指标矢量器,作为FlegiT 模型的补充输入。使用各种技术组合,我们表明,可以使用一个单一的模型来改进WER 和真实时间要素的配置,同时使用最佳命令。