Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech. There have been studies to use a pre-recorded speech sample or face image of the target speaker as the speaker cue. In human communication, co-speech gestures that are naturally timed with speech also contribute to speech perception. In this work, we explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction, which could be easily obtained from low-resolution video recordings, thus more available than face recordings. We propose two networks using the co-speech gestures cue to perform attentive listening on the target speaker, one that implicitly fuses the co-speech gestures cue in the speaker extraction process, the other performs speech separation first, followed by explicitly using the co-speech gestures cue to associate a separated speech to the target speaker. The experimental results show that the co-speech gestures cue is informative in associating with the target speaker.
翻译:发言者的抽音试图从多声带混合式演讲中提取目标发言者的干净语言。 已经研究过使用预先录制的语音样本或目标发言者的面部图像作为发言者的提示。 在人文交流中,与发言自然时间搭配的共同语音手势也有助于语音感知。 在这项工作中,我们探索使用共同语音手势序列,例如手和身体运动,作为发言者抽音的提示,这可以很容易地从低分辨率录像中获取,因此比面部录音更容易获得。 我们提议使用共同语音手势的两个网络对目标发言者进行仔细的监听,一个隐含将共同语音手势牵引在发言者抽音过程中的手势连接起来的网络,另一个则先进行语音分离,然后明确使用共同语音手势提示将分开的演讲与目标发言者联系起来。 实验结果显示,共同语音手势提示在与目标发言者联系时是知情的。