We study the cocktail party problem and propose a novel attention network called Tune-In, abbreviated for training under negative environments with interference. It firstly learns two separate spaces of speaker-knowledge and speech-stimuli based on a shared feature space, where a new block structure is designed as the building block for all spaces, and then cooperatively solves different tasks. Between the two spaces, information is cast towards each other via a novel cross- and dual-attention mechanism, mimicking the bottom-up and top-down processes of a human's cocktail party effect. It turns out that substantially discriminative and generalizable speaker representations can be learnt in severely interfered conditions via our self-supervised training. The experimental results verify this seeming paradox. The learnt speaker embedding has superior discriminative power than a standard speaker verification method; meanwhile, Tune-In achieves remarkably better speech separation performances in terms of SI-SNRi and SDRi consistently in all test modes, and especially at lower memory and computational consumption, than state-of-the-art benchmark systems.
翻译:我们研究鸡尾酒党问题,并提议一个新的关注网络,名为Tune-In, 缩略用于在负面干扰环境下的培训。它首先学习了基于共同地物空间的两个单独的语音知识和语音刺激空间,其中一个新的区块结构被设计成所有空间的构件,然后合作解决不同的任务。在这两个空间之间,信息通过一个新的交叉和双重注意机制相互传递,模仿自下而上和自上而下的人类鸡尾酒会效应进程。它证明,通过自我监督的培训,可以从严重干扰的条件下学习大量歧视性和通用的语音表达。实验结果证实了这种似乎矛盾。学习的演讲者嵌入比标准的演讲者核实方法具有优越的歧视性力量;同时,Tune-In在所有测试模式中,特别是在较低的记忆和计算消费方面,在所有测试模式中,特别是在比州级基准系统更低的记忆和计算上,在SI-SNRI和STI级的语音分离表现都非常出色。