In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.
翻译:在本文中,我们考虑在静默视频序列中定位口语关键词的任务,也称为视觉关键词识别。为此,我们调查了以变换器为基础的模型,这些模型吸收了两种流,即视频的视觉编码和关键词的语音编码,如果存在的话,则输出关键词的时间位置。我们的贡献如下:(1) 我们提出了一个新颖的结构,即跨式插播器,在视觉流和语音流之间使用完全的跨式关注;(2) 我们通过广泛评估显示,我们的模型在具有挑战性的LRW、LRS2、LRS3数据集方面,超过了先前最先进的视觉关键词识别和唇读方法;(3) 我们展示了我们的模型能够在手语视频中孤立的语语极端条件下识别字句的能力。