Automatic speech recognition (ASR) models are typically designed to operate on a single input data type, e.g. a single or multi-channel audio streamed from a device. This design decision assumes the \textit{primary} input data source does not change and if an additional (\textit{auxiliary}) data source is occasionally available, it cannot be used. An ASR model that operates on both primary and auxiliary data can achieve better accuracy compared to a primary-only solution; and a model that can serve both \textit{primary-only} (PO) and \textit{primary-plus-auxiliary} (PPA) modes is highly desirable. In this work, we propose a unified ASR model that can serve both modes. We demonstrate its efficacy in a realistic scenario where a set of devices typically stream a single primary audio channel, and two additional auxiliary channels \textit{only when} upload bandwidth allows it. The architecture enables a unique methodology that uses both types of input audio during training time. Our proposed approach achieves up to 12.5\% relative word-error-rate reduction (WERR) compared to a PO baseline, and up to 16.0\% relative WERR in low-SNR conditions. The unique training methodology achieves up to 2.5\% relative WERR compared to a PPA baseline.
翻译:自动语音识别( ASR) 模式通常设计为在单一输入数据类型上运行,例如从设备中流出单一或多通道的音频模式。 本设计决定假定了\ textit{ broid} 输入数据源不会改变,如果偶尔有额外的( textit{ subliary}) 数据源,则无法使用。 在初级数据和辅助数据上运行的ASR模式比初级和辅助数据的解决方案都能实现更好的准确性; 以及 一种既能为设备提供单一输入数据类型(PO),又能为设备提供单一或多渠道的音频(PPA) 模式。 在这项工作中,我们提出了一个统一的ASR模式,可以两种模式都服务于两种模式。 我们在一个现实的假设中展示了该模式的有效性,即一套设备通常流出单一初级音道,而另外两个辅助频道(textitutit{ { 只有当上传带宽度允许它使用初级解决方案时才能实现一个独特的方法,在培训期间使用两种输入类型的音频。 我们提议的方法达到了12.5 相对的WE- ror- rat- rat- res- la a com com com 相对基准, com comnial deal deal as vial as aquilal bir subilate asilate deal deal subilate le le lemental lemental lemental lemental lemental lemental.