As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task. Apart from being accurate and compact, such systems need to decode speech with low user-perceived latency (UPL), producing words as soon as they are spoken. This work examines the impact of various techniques -- model architectures, training criteria, decoding hyperparameters, and endpointer parameters -- on UPL. Our analyses suggest that measures of model size (parameters, input chunk sizes), or measures of computation (e.g., FLOPS, RTF) that reflect the model's ability to process input frames are not always strongly correlated with observed UPL. Thus, conventional algorithmic latency measurements might be inadequate in accurately capturing latency observed when models are deployed on embedded devices. Instead, we find that factors affecting token emission latency, and endpointing behavior significantly impact on UPL. We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
翻译:随着智能手机和智能扬声器等语音辅助装置越来越普遍,人们越来越有兴趣建立自动语音识别系统(ASR),这些系统可以直接在设计设备上运行;端到端语音识别模型,如经常性神经网络传输器及其变体,最近成为这项任务的主要候选工具。除了准确和紧凑,这些系统需要用低用户读取的拉链(UPL)解码语音,一旦使用时就生成文字。这项工作审查了各种技术 -- -- 模型结构、培训标准、解码超分计和端点参数 -- -- 对UPL的影响。我们的分析表明,模型大小(参数、输入块大小)的计量或计算(例如FLOPS、RTF)反映模型处理输入框架的能力的计算尺度(例如,FLOPS、RTF),并非总能与所观察到的UPL(UPL)密切关联。因此,常规算法拉力测量可能不足以准确捕捉到在嵌入装置上安装模型时观察到的拉特(laten)。相反,我们发现,在使用影响最高汇率时,我们最近使用最佳汇率时会达到最低汇率时,我们压时会点。