The Transformer model, particularly its cross-attention module, is widely used for feature fusion in target sound extraction which extracts the signal of interest based on given clues. Despite its effectiveness, this approach suffers from low computational efficiency. Recent advancements in state space models, notably the latest work Mamba, have shown comparable performance to Transformer-based methods while significantly reducing computational complexity in various tasks. However, Mamba's applicability in target sound extraction is limited due to its inability to capture dependencies between different sequences as the cross-attention does. In this paper, we propose CrossMamba for target sound extraction, which leverages the hidden attention mechanism of Mamba to compute dependencies between the given clues and the audio mixture. The calculation of Mamba can be divided to the query, key and value. We utilize the clue to generate the query and the audio mixture to derive the key and value, adhering to the principle of the cross-attention mechanism in Transformers. Experimental results from two representative target sound extraction methods validate the efficacy of the proposed CrossMamba.
翻译:Transformer模型,特别是其交叉注意力模块,被广泛用于目标声音提取中的特征融合,该任务旨在根据给定线索提取感兴趣的信号。尽管该方法有效,但其计算效率较低。状态空间模型的最新进展,特别是最新工作Mamba,已在多种任务中展现出与基于Transformer的方法相当的性能,同时显著降低了计算复杂度。然而,由于Mamba无法像交叉注意力那样捕获不同序列之间的依赖关系,其在目标声音提取中的应用受到限制。本文提出用于目标声音提取的CrossMamba,它利用Mamba的隐藏注意力机制来计算给定线索与音频混合信号之间的依赖关系。Mamba的计算可分解为查询、键和值。我们利用线索生成查询,并利用音频混合信号推导键和值,遵循了Transformer中交叉注意力机制的原理。两种代表性目标声音提取方法的实验结果验证了所提CrossMamba的有效性。