This study presents UX-Net, a time-domain audio separation network (TasNet) based on a modified U-Net architecture. The proposed UX-Net works in real-time and handles either single or multi-microphone input. Inspired by the filter-and-process-based human auditory behavior, the proposed system introduces novel mixer and separation modules, which result in cost and memory efficient modeling of speech sources. The mixer module combines encoded input in a latent feature space and outputs a desired number of output streams. Then, in the separation module, a modified U-Net (UX) block is applied. The UX block first filters the encoded input at various resolutions followed by aggregating the filtered information and applying recurrent processing to estimate masks of separated sources. The letter 'X' in UX-Net is a name placeholder for the type of recurrent layer employed in the UX block. Empirical findings on the WSJ0-2mix benchmark dataset show that one of the UX-Net configurations outperforms the state-of-the-art Conv-TasNet system by 0.85 dB SI-SNR while using only 16% of the model parameters, 58% fewer computations, and maintaining low latency.
翻译:此项研究展示了UX- Net, 这是一种基于修改 U- Net 架构的时空音分离网络( TasNet ) 。 拟议的 UX- Net 网络在实时运行, 处理单一或多麦克风输入。 受基于过滤和处理的人类听觉行为的启发, 提议的系统引入了新型混合和分离模块, 导致语音源的成本和内存的建模。 混合模块将隐性空间的编码输入和输出输出数相匹配。 然后, 在分离模块中, 应用了修改的 U- Net (UX) 块。 UX 块首先过滤了各种分辨率的编码输入, 并随后合并了过滤的信息, 并应用了常规处理来估计分离源的掩码。 UX- Net 中的字母“ X” 是用于UX 区段使用的经常层类型的名称占位符。 WSJ0-2mix基准数据集的“ 经验” 显示, UX- Net 配置之一超越了模型的状态, NAv- TIS- TaNet 16 的低位参数, 由 0. 5 dB 维持, 而 低调 IS- b 系统仅 的IS- d.