Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three parts: channel-wise self attention layers (CSA), cross-channel attention layers (CCA), and multi-channel encoder-decoder attention layers (EDA). The CSA and CCA layers encode the contextual relationship within and between channels and across time, respectively. The channel-attended outputs from CSA and CCA are then fed into the EDA layers to help decode the next token given the preceding ones. The experiments show that in a far-field in-house dataset, our method outperforms the baseline single-channel transformer, as well as the super-directive and neural beamformers cascaded with the transformers.
翻译:在本文中,我们利用神经变压器结构将神经变压器结构用于多通道语音识别系统,从不同麦克风收集的光谱和空间信息使用注意层集成。我们的多通道变压器网络主要由三个部分组成:频道自关注层(CSA)、跨通道关注层(CCA)和多通道解码器注意层(EDA)。CSA和CCA层分别对各频道内部和之间以及不同时间的背景关系进行了编码。CSA和CCA的频道访问输出随后被输入EDA层,以帮助解码前面的下一个标记。实验显示,在远方的内部数据集中,我们的方法超越了与变压器相联的基线单通道变压器,以及超导式和神经变压器。