Lack of audio-video synchronization is a common problem during television broadcasts and video conferencing, leading to an unsatisfactory viewing experience. A widely accepted paradigm is to create an error detection mechanism that identifies the cases when audio is leading or lagging. We propose ModEFormer, which independently extracts audio and video embeddings using modality-specific transformers. Different from the other transformer-based approaches, ModEFormer preserves the modality of the input streams which allows us to use a larger batch size with more negative audio samples for contrastive learning. Further, we propose a trade-off between the number of negative samples and number of unique samples in a batch to significantly exceed the performance of previous methods. Experimental results show that ModEFormer achieves state-of-the-art performance, 94.5% for LRS2 and 90.9% for LRS3. Finally, we demonstrate how ModEFormer can be used for offset detection for test clips.
翻译:缺乏声音和视频同步是电视广播和视频会议中常见的问题,会导致观看体验不佳。一种被广泛接受的范式是创建一个错误检测机制,以识别音频超前或落后的情况。我们提出了ModEFormer,它使用模态特定的变压器独立提取音频和视频嵌入。与其他基于变压器的方法不同,ModEFormer保留了输入流的模态,这允许我们使用更大的批大小与更多负面音频样本进行对比学习。此外,我们提出了负样本数量和批中唯一样本数量之间的权衡,以显著超越先前方法的性能。实验结果表明,对于 LRS2,ModEFormer实现了最新的性能,达到了94.5%,对于LRS3达到了90.9%。最后,我们演示了如何使用ModEFormer对测试剪辑进行偏移检测。