The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses key issues of sign language recognition. These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations. To this end, we propose (1) Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without the need for additional networks or annotations, and (2) Dense Pseudo-Label Refinement (DPLR) which propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence labels with the predicted sequence. We demonstrate that our model achieves state-of-the-art performance among RGB-based methods on large-scale CSLR benchmarks, PHOENIX-2014 and PHOENIX-2014-T, while showing comparable results with better efficiency when compared to other approaches that use multi-modality or extra annotations.
翻译:本文的目标是开发一种解决手语识别中关键问题的自给自足的连续手语识别(CSLR)框架,这些关键问题包括为了理解需要复杂的多尺度要素,如手、面部和口腔等,以及缺乏帧级别标注等。为此,我们提出了两个解决方案:(1)分割和聚焦卷积(DFConv),它可以提取手工和非手工要素,而不需要额外的网络或注释,(2)密集伪标签细化(DPLR),它通过将真实的语汇序列标签与预测序列相结合来传递非尖锐帧级伪标签。我们证明,我们的模型在大规模CSLR基准测试PHOENIX-2014和PHOENIX-2014-T中, achieved与使用多模态或额外注释的其他方法相比,显示出具有可比性的更好效率的水平。