Most of the vision-based sign language research to date has focused on Isolated Sign Language Recognition (ISLR), where the objective is to predict a single sign class given a short video clip. Although there has been significant progress in ISLR, its real-life applications are limited. In this paper, we focus on the challenging task of Sign Spotting instead, where the goal is to simultaneously identify and localise signs in continuous co-articulated sign videos. To address the limitations of current ISLR-based models, we propose a hierarchical sign spotting approach which learns coarse-to-fine spatio-temporal sign features to take advantage of representations at various temporal levels and provide more precise sign localisation. Specifically, we develop Hierarchical Sign I3D model (HS-I3D) which consists of a hierarchical network head that is attached to the existing spatio-temporal I3D model to exploit features at different layers of the network. We evaluate HS-I3D on the ChaLearn 2022 Sign Spotting Challenge - MSSL track and achieve a state-of-the-art 0.607 F1 score, which was the top-1 winning solution of the competition.
翻译:迄今为止,大多数基于视觉的手语研究都侧重于孤立手语识别(ISLR),其目标是预测一个单一手语类,给一个简短的视频片段。虽然ISLR取得了显著进展,但其实际应用有限。在本文中,我们侧重于标志点点的艰巨任务,目的是同时在连续共同演示的手语视频中识别和定位标志。为了解决目前以ISLR为基础的模式的局限性,我们提议了一种等级标志识别方法,即学习粗皮到松皮的时空标志特征,以利用不同时间层次的演示,并提供更精确的标识位置化。具体地说,我们开发了等级标志 I3D 模型(HS-I3D), 其中包括一个与现有的spatio-时间点 I3D 模型相连的等级网络头目, 以利用网络不同层次的特征。我们评估了ChaLearn 2022 Sign Spointing Tack- MSSLD 轨道上的HS- I3D, 并取得了第0.607 F1 号评分的状态,这是最顶级的排名。