Audio-visual approaches involving visual inputs have laid the foundation for recent progress in speech separation. However, the optimization of the concurrent usage of auditory and visual inputs is still an active research area. Inspired by the cortico-thalamo-cortical circuit, in which the sensory processing mechanisms of different modalities modulate one another via the non-lemniscal sensory thalamus, we propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS). First, the CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visual subnetworks, mimicking the functions of the auditory and visual cortical areas. Then, inspired by the large number of connections between cortical regions and the thalamus, the model fuses the auditory and visual information in a thalamic subnetwork through top-down connections. Finally, the model transmits this fused information back to the auditory and visual subnetworks, and the above process is repeated several times. The results of experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerablely fewer parameters. These results suggest that mimicking the anatomical connectome of the mammalian brain has great potential for advancing the development of deep neural networks. Project repo is https://github.com/JusperLee/CTCNet.
翻译:包含视觉投入的视听方法为最近语音分离的进展奠定了基础。然而,同时使用听觉和视觉投入的优化使用仍然是一个活跃的研究领域。受Cotico-thalamo-cortical 电路的启发,不同模式的感官处理机制通过非lemniscal感官系统Thalamus相互调节,我们提议为视听语音分离建立一个新型的cortico-thalamo-cortal 神经网络(CTCNet)。首先,CTNet以自下而上的方式,在不同的听觉和视觉亚网络中学习分级的听觉和视觉表现,模仿听觉和视觉领域的功能。随后,由于不同模式的感官处理机制通过非lemniscal感官系统(Thalamus)的感官和视觉感官感官系统(CTCNet)之间的大量连接,我们提议为听觉和视觉的神经下方语言分离网络(CTC-J)子网络(CTS)网络(CD-I)的感官和视觉子网络(视觉子网络)的视觉分级和视觉子网络(以上进程)反复出现若干次过程。随后的实验结果将CTSA-SA-rmalalimalmamal 的实验结果连接起来,这些实验结果显示了CTSA-smamalmamalmamalmam 3的大规模的大规模的模型的实验结果。这些实验结果。这些巨大的实验结果显示的模型将大量的实验性实验性数据比了CTSA-