Fingerspelling in sign language has been the means of communicating technical terms and proper nouns when they do not have dedicated sign language gestures. Automatic recognition of fingerspelling can help resolve communication barriers when interacting with deaf people. The main challenges prevalent in fingerspelling recognition are the ambiguity in the gestures and strong articulation of the hands. The automatic recognition model should address high inter-class visual similarity and high intra-class variation in the gestures. Most of the existing research in fingerspelling recognition has focused on the dataset collected in a controlled environment. The recent collection of a large-scale annotated fingerspelling dataset in the wild, from social media and online platforms, captures the challenges in a real-world scenario. In this work, we propose a fine-grained visual attention mechanism using the Transformer model for the sequence-to-sequence prediction task in the wild dataset. The fine-grained attention is achieved by utilizing the change in motion of the video frames (optical flow) in sequential context-based attention along with a Transformer encoder model. The unsegmented continuous video dataset is jointly trained by balancing the Connectionist Temporal Classification (CTC) loss and the maximum-entropy loss. The proposed approach can capture better fine-grained attention in a single iteration. Experiment evaluations show that it outperforms the state-of-the-art approaches.
翻译:手语拼手指是手语中技术术语和适当名词的交流手段,当手语没有专门的手语手势时,则以手语拼字法作为传达技术术语和适当名词的手段。在与聋人互动时,自动识别手指拼字可以帮助消除沟通障碍。在手指拼字法中普遍存在的主要挑战在于手法的模棱两可。自动识别模型应该针对高等级之间视觉相似性和高等级内位变化的手势。现有的手指拼字识别研究大多侧重于在受控环境中收集的数据集。最近从社交媒体和在线平台收集的野生、有注释的手指拼字数据集大规模收集,捕捉现实世界情景中的挑战。在这项工作中,我们建议使用变形模型来精确地观察视觉关注野生数据集中顺序到顺序的预测任务。通过在连续背景环境中使用变形的图像框架(光流)的变化以及一个变形的编码模型来引起人们的注意。未加固的连续视频拼图式数据拼图式数据拼图式套图解在现实世界情景中遇到的挑战。我们建议采用最精细的视觉化的视觉数据分类方法,从而平衡了一次损失分析,从而平衡地分析了一次损失分析了一次损失分析方法。