Recent research on the time-domain audio separation networks (TasNets) has brought great success to speech separation. Nevertheless, conventional TasNets struggle to satisfy the memory and latency constraints in industrial applications. In this regard, we design a low-cost high-performance architecture, namely, globally attentive locally recurrent (GALR) network. Alike the dual-path RNN (DPRNN), we first split a feature sequence into 2D segments and then process the sequence along both the intra- and inter-segment dimensions. Our main innovation lies in that, on top of features recurrently processed along the inter-segment dimensions, GALR applies a self-attention mechanism to the sequence along the inter-segment dimension, which aggregates context-aware information and also enables parallelization. Our experiments suggest that GALR is a notably more effective network than the prior work. On one hand, with only 1.5M parameters, it has achieved comparable separation performance at a much lower cost with 36.1% less runtime memory and 49.4% fewer computational operations, relative to the DPRNN. On the other hand, in a comparable model size with DPRNN, GALR has consistently outperformed DPRNN in three datasets, in particular, with a substantial margin of 2.4dB absolute improvement of SI-SNRi in the benchmark WSJ0-2mix task.
翻译:最近对时空音频分离网络(TasNets)的研究取得了巨大成功。然而,传统的TasNets努力满足工业应用中的记忆和延迟限制。在这方面,我们设计了一个低成本的高性能结构,即全球关注的本地经常性(GALR)网络。像双路径RNN(DPRNN)一样,我们首先将一个特征序列分成2D区段,然后处理内部和内部部分之间的序列。我们的主要创新在于,在沿分层层面经常处理的特征之上,GasNets对分层的序列采用自留机制,以综合背景觉信息,并能够实现平行化。我们的实验表明,GARLR是一个明显比先前工作的更有效的网络。一方面,我们只有1.5M参数,它实现了可比的分离性能,比DPRNNNN要低得多,比DPRNNNN多36.1%,比DRNNNN的计算操作少49.4%。另一方面,其绝对模型比DPRNIS要高,其绝对值比为DPRMRMS。