Isolated Sign Language Recognition (ISLR) is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world. However, robust ISLR is fundamentally constrained by data scarcity and the long-tail distribution of sign vocabulary, where gathering sufficient examples for thousands of unique signs is prohibitively expensive. Standard classification approaches struggle under these conditions, often overfitting to frequent classes while failing to generalize to rare ones. To address this bottleneck, we propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder. Unlike traditional classifiers that learn fixed decision boundaries, our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes. We integrate a Spatiotemporal Graph Convolutional Network (ST-GCN) with a novel Multi-Scale Temporal Aggregation (MSTA) module to capture both rapid and fluid motion dynamics. Experimental results on the WLASL dataset demonstrate the superiority of this metric learning paradigm: our model achieves 43.75% Top-1 and 77.10% Top-5 accuracy on the test set. Crucially, this outperforms a standard classification baseline sharing the identical backbone architecture by over 13%, proving that the prototypical training strategy effectively outperforms in a data scarce situation where standard classification fails. Furthermore, the model exhibits strong zero-shot generalization, achieving nearly 30% accuracy on the unseen SignASL dataset without fine-tuning, offering a scalable pathway for recognizing extensive sign vocabularies with limited data.
翻译:孤立手语识别对于弥合聋哑及听力障碍群体与健听世界之间的沟通鸿沟至关重要。然而,数据稀缺性和手语词汇的长尾分布从根本上制约了鲁棒的孤立手语识别发展,为数千个独特手语收集足够样本的成本极高。传统分类方法在此条件下表现不佳,常对高频类别过拟合而难以泛化至罕见类别。为突破此瓶颈,我们提出一种适用于骨架编码器的少样本原型网络框架。与学习固定决策边界的传统分类器不同,本方法通过情景式训练学习语义度量空间,依据手语样本与动态类别原型之间的邻近度进行分类。我们结合时空图卷积网络与创新的多尺度时序聚合模块,以捕捉快速与流畅的运动动态。在WLASL数据集上的实验结果表明该度量学习范式的优越性:我们的模型在测试集上达到43.75%的Top-1准确率和77.10%的Top-5准确率。关键的是,在相同骨干架构下,本方法较标准分类基线提升超过13%,证明原型训练策略在标准分类失效的数据稀缺场景中具有显著优势。此外,该模型展现出强大的零样本泛化能力,在未经微调的SignASL未见数据集上实现近30%的准确率,为基于有限数据识别大规模手语词汇提供了可扩展的路径。