IMSE：基于U-Net的高效语音增强方法——采用Inception深度可分离卷积与振幅感知线性注意力机制 (IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention)

Achieving a balance between lightweight design and high performance remains a significant challenge for speech enhancement (SE) tasks on resource-constrained devices. Existing state-of-the-art methods, such as MUSE, have established a strong baseline with only 0.51M parameters by introducing a Multi-path Enhanced Taylor (MET) transformer and Deformable Embedding (DE). However, an in-depth analysis reveals that MUSE still suffers from efficiency bottlenecks: the MET module relies on a complex "approximate-compensate" mechanism to mitigate the limitations of Taylor-expansion-based attention, while the offset calculation for deformable embedding introduces additional computational burden. This paper proposes IMSE, a systematically optimized and ultra-lightweight network. We introduce two core innovations: 1) Replacing the MET module with Amplitude-Aware Linear Attention (MALA). MALA fundamentally rectifies the "amplitude-ignoring" problem in linear attention by explicitly preserving the norm information of query vectors in the attention calculation, achieving efficient global modeling without an auxiliary compensation branch. 2) Replacing the DE module with Inception Depthwise Convolution (IDConv). IDConv borrows the Inception concept, decomposing large-kernel operations into efficient parallel branches (square, horizontal, and vertical strips), thereby capturing spectrogram features with extremely low parameter redundancy. Extensive experiments on the VoiceBank+DEMAND dataset demonstrate that, compared to the MUSE baseline, IMSE significantly reduces the parameter count by 16.8\% (from 0.513M to 0.427M) while achieving competitive performance comparable to the state-of-the-art on the PESQ metric (3.373). This study sets a new benchmark for the trade-off between model size and speech quality in ultra-lightweight speech enhancement.

翻译：在资源受限设备上实现语音增强任务中轻量化设计与高性能之间的平衡仍是一项重大挑战。现有最先进方法（如MUSE）通过引入多路径增强泰勒变换器和可变形嵌入，仅用0.51M参数建立了强大基线。然而深入分析表明，MUSE仍存在效率瓶颈：MET模块依赖复杂的“近似-补偿”机制来缓解基于泰勒展开的注意力机制局限，而可变形嵌入的偏移量计算则带来额外计算负担。本文提出IMSE——一个系统优化且超轻量的网络。我们引入两项核心创新：1）用振幅感知线性注意力取代MET模块。MALA通过在注意力计算中显式保留查询向量的范数信息，从根本上修正了线性注意力中“忽略振幅”的问题，无需辅助补偿分支即可实现高效的全局建模。2）用Inception深度可分离卷积取代DE模块。IDConv借鉴Inception思想，将大核操作分解为高效的并行分支（方形、水平和垂直条状卷积），从而以极低的参数冗余捕获声谱图特征。在VoiceBank+DEMAND数据集上的大量实验表明，相较于MUSE基线，IMSE在PESQ指标上达到与最先进方法相当的竞争性性能（3.373），同时参数量显著降低16.8%（从0.513M降至0.427M）。本研究为超轻量语音增强中模型规模与语音质量的权衡设定了新基准。