The focus of this work is $\textit{sign spotting}$ - given a video of an isolated sign, our task is to identify $\textit{whether}$ and $\textit{where}$ it has been signed in a continuous, co-articulated sign language video. To achieve this sign spotting task, we train a model using multiple types of available supervision by: (1) $\textit{watching}$ existing footage which is sparsely labelled using mouthing cues; (2) $\textit{reading}$ associated subtitles (readily available translations of the signed content) which provide additional $\textit{weak-supervision}$; (3) $\textit{looking up}$ words (for which no co-articulated labelled examples are available) in visual sign language dictionaries to enable novel sign spotting. These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning. We validate the effectiveness of our approach on low-shot sign spotting benchmarks. In addition, we contribute a machine-readable British Sign Language (BSL) dictionary dataset of isolated signs, BSLDict, to facilitate study of this task. The dataset, models and code are available at our project page.
翻译:这项工作的焦点是 $\ textit{ 符号斑点} $ - 给一个孤立符号的视频, 我们的任务是在连续、 共同配方手语视频中识别 $\ textit{ 不论$ $ 和 $\ textit{ 美元 。 为了实现这个标识点点点任务, 我们用多种可用的监督模式培训一个模型, 其方式是:(1) $\ textit{ 观察} 现有视频, 其标签使用口语提示贴标签很少; (2) $\ textit{ 读} 相关字幕( 签名内容的易读翻译), 提供了额外的 $\ textit{ weak- suspect} ; (3) $\ text{ text{ { } $( 没有共同配方标签示例 ) 在视觉手语字典词典中, 这三个任务被整合到一个统一的学习框架, 使用“ 低镜头标点标点” 基准。 此外, 我们为机器可读的英国手语 数据模型 和“ BLLD” 任务模型的孤立数据模型研究提供。