Personalized speech enhancement (PSE), a process of estimating a clean target speech signal in real time by leveraging a speaker embedding vector of the target talker, has garnered much attention from the research community due to the recent surge of online meetings across the globe. For practical full duplex communication, PSE models require an acoustic echo cancellation (AEC) capability. In this work, we employ a recently proposed causal end-to-end enhancement network (E3Net) and modify it to obtain a joint PSE-AEC model. We dedicate the early layers to the AEC task while encouraging later layers for personalization by adding a bypass connection from the early layers to the mask prediction layer. This allows us to employ a multi-task learning framework for joint PSE and AEC training. We provide extensive evaluation test scenarios with both simulated and real-world recordings. The results show that our joint model comes close to the expert models for each task and performs significantly better for the combined PSE-AEC scenario.
翻译:个人化语音增强(PSE)是一个通过利用目标谈话器的发言人嵌入矢量实时估算清洁目标语言信号的过程,由于最近全球各地在线会议激增,这一过程引起了研究界的极大关注。对于实际的双面通信而言,PSE模型要求有声回声取消能力。在这项工作中,我们采用了最近提议的因果端到端增强网络(E3Net),并对其进行修改,以获得一个PSE-AEC联合模型。我们把早期层用于AEC任务,同时鼓励后层个人化,从早期层到遮罩预测层增加一个绕行连接。这使我们能够为PSE和AEC联合培训使用多任务学习框架。我们提供了模拟和真实世界录音的广泛评价测试情景。结果显示,我们的联合模型接近每项任务的专家模型,并且为综合的PSE-AEC情景进行更好的表现。