Automatic speech recognition (ASR) in the cloud allows the use of larger models and more powerful multi-channel signal processing front-ends compared to on-device processing. However, it also adds an inherent latency due to the transmission of the audio signal, especially when transmitting multiple channels of a microphone array. One way to reduce the network bandwidth requirements is client-side compression with a lossy codec such as Opus. However, this compression can have a detrimental effect especially on multi-channel ASR front-ends, due to the distortion and loss of spatial information introduced by the codec. In this publication, we propose an improved approach for the compression of microphone array signals based on Opus, using a modified joint channel coding approach and additionally introducing a multi-channel spatial decorrelating transform to reduce redundancy in the transmission. We illustrate the effect of the proposed approach on the spatial information retained in multi-channel signals after compression, and evaluate the performance on far-field ASR with a multi-channel beamforming front-end. We demonstrate that our approach can lead to a 37.5 % bitrate reduction or a 5.1 % relative word error rate reduction for a fixed bitrate budget in a seven channel setup.
翻译:在云层中自动语音识别(ASR)允许使用更大的模型和较强大的多声道信号处理前端,而不是在设备上处理。然而,它也增加了一个内在的延迟,因为音频信号的传输,特别是在传输麦克风阵列的多个频道时。降低网络带宽要求的一个办法是客户端压缩,使用奥普斯这样的丢失的调解码器。然而,由于编码器引入的空间信息扭曲和丢失,这种压缩可能会对多频道ASR前端产生有害影响。在这个出版物中,我们建议采用一个改进的方法来压缩基于Opus的麦克风阵列信号,使用一个经过修改的联合频道编码方法,并另外引入一个多频道空间变形变换,以减少传输中的冗余。我们演示了拟议办法对压缩后多声道信号中保留的空间信息的影响,并用一个多声道对前端进行调整的多声道显示,我们的方法可以导致37.5%的比特拉特节信号的压缩,或者为7.11%的相对错误率。我们演示我们的方法可以导致在7.5%的比特拉里拉节节节节降低预算率。