Deep neural networks have shown excellent prospects in speech separation tasks. However, obtaining good results while keeping a low model complexity remains challenging in real-world applications. In this paper, we provide a bio-inspired efficient encoder-decoder architecture by mimicking the brain's top-down attention, called TDANet, with decreased model complexity without sacrificing performance. The top-down attention in TDANet is extracted by the global attention (GA) module and the cascaded local attention (LA) layers. The GA module takes multi-scale acoustic features as input to extract global attention signal, which then modulates features of different scales by direct top-down connections. The LA layers use features of adjacent layers as input to extract the local attention signal, which is used to modulate the lateral input in a top-down manner. On three benchmark datasets, TDANet consistently achieved competitive separation performance to previous state-of-the-art (SOTA) methods with higher efficiency. Specifically, TDANet's multiply-accumulate operations (MACs) are only 5\% of Sepformer, one of the previous SOTA models, and CPU inference time is only 10\% of Sepformer. In addition, a large-size version of TDANet obtained SOTA results on three datasets, with MACs still only 10\% of Sepformer and the CPU inference time only 24\% of Sepformer. Our study suggests that top-down attention can be a more efficient strategy for speech separation.
翻译:深神经网络在语音分离任务中表现出了极佳的前景。 但是,在现实世界应用中,在保持低模型复杂性的同时取得良好结果,在保持低模型复杂性的同时,在获得良好的结果方面仍具有挑战性。 在本文中,我们通过模仿大脑自上而下的关注,提供一种生物启发高效的编码解码结构,称为TDANet,其模型复杂性降低,而不会牺牲性能。TDANet的自上而下的关注通过全球关注模块和本地连锁关注层获得自上而下的关注。GA模块将多尺度的声学功能作为提取全球关注信号的投入,然后通过直接自上而下的连接来调节不同尺度的特征。LA层使用相邻层的特征作为提取本地关注信号的投入,用于自上而下地调整横向投入。在三个基准数据集中,TDANet一直以竞争的方式将业绩分解到以往的状态(SOTA)模块。 具体地说,TDANet的倍增累积操作(MAC)只是Sepreferent 的Sepreal 的Sepreal comstal commeal comstal Stapal deal deal developal deal deal deal deal deal deal deal 10 a ad destal destal destal deal develse slaction slaction slaction.