Conformer has proven to be effective in many speech processing tasks. It combines the benefits of extracting local dependencies using convolutions and global dependencies using self-attention. Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer, with parallel branches for modeling various ranged dependencies in end-to-end speech processing. In each encoder layer, one branch employs self-attention or its variant to capture long-range dependencies, while the other branch utilizes an MLP module with convolutional gating (cgMLP) to extract local relationships. We conduct experiments on several speech recognition and spoken language understanding benchmarks. Results show that our model outperforms both Transformer and cgMLP. It also matches with or outperforms state-of-the-art results achieved by Conformer. Furthermore, we show various strategies to reduce computation thanks to the two-branch architecture, including the ability to have variable inference complexity in a single trained model. The weights learned for merging branches indicate how local and global dependencies are utilized in different layers, which benefits model designing.
翻译:语言处理任务中, 组合已被证明在许多语音处理任务中有效 。 它结合了利用演变和自省来提取本地依赖性的好处 。 受此启发, 我们提出一个更灵活、 可解释和可定制的编码器替代方案, 分支, 并配有平行分支, 用于模拟终端到终端语音处理中的各种依赖性 。 在每个编码层, 一个分支使用自我注意或变式来捕捉远程依赖性, 而另一个分支则使用一个带有演进 gluting (cGMLP) 的 MLP 模块来提取本地关系 。 我们在几个语音识别和口头语言理解基准上进行了实验 。 结果表明, 我们的模型优于变换者和 cgMLP 。 也匹配或优于 Conferent 所实现的状态结果 。 此外, 我们展示了由于两个分支结构而减少计算的各种战略, 包括能够使单一培训模式的推论复杂性变能力 。 用于合并分支的权重显示如何在不同的层次上利用本地和全球依赖性, 。