Audio-visual target speech extraction, which aims to extract a certain speaker's speech from the noisy mixture by looking at lip movements, has made significant progress combining time-domain speech separation models and visual feature extractors (CNN). One problem of fusing audio and video information is that they have different time resolutions. Most current research upsamples the visual features along the time dimension so that audio and video features are able to align in time. However, we believe that lip movement should mostly contain long-term, or phone-level information. Based on this assumption, we propose a new way to fuse audio-visual features. We observe that for DPRNN \cite{dprnn}, the interchunk dimension's time resolution could be very close to the time resolution of video frames. Like \cite{sepformer}, the LSTM in DPRNN is replaced by intra-chunk and inter-chunk self-attention, but in the proposed algorithm, inter-chunk attention incorporates the visual features as an additional feature stream. This prevents the upsampling of visual cues, resulting in more efficient audio-visual fusion. The result shows we achieve superior results compared with other time-domain based audio-visual fusion models.
翻译:视听目标语音提取,目的是通过观察嘴唇运动,从噪音混合物中提取某一发言者的讲话,从噪音中提取某种声音,通过查看嘴唇运动,取得了显著的进展,结合了时间-空间语言分离模型和视觉特征提取器(CNN),取得了显著的进展。 熔化音频和视频信息的一个问题是,它们具有不同的时间分辨率。 大部分当前的研究在时间维度上展示了视觉特征,使音频和视频功能能够及时保持一致。 然而,我们认为,唇动应主要包含长期或电话级信息。 基于这一假设,我们提出了一种融合视听特征的新方法。 我们观察到,对于DPRNN\ cite{dprnn} 来说,内部空间维度的时间分辨率可能非常接近于视频框架的时间分辨率。 像\ cite{sexexent} 一样, DPRNN的LSTM在时间维度上可以由中层内和中层间自控功能取代, 但是在拟议的算法中, 中间关注将视觉特征作为额外的特征流。 我们发现,这阻碍了视觉提示的升级, 导致以其他的视听模型相比结果。</s>