Today's sign language recognition models require large training corpora of laboratory-like videos, whose collection involves an extensive workforce and financial resources. As a result, only a handful of such systems are publicly available, not to mention their limited localization capabilities for less-populated sign languages. Utilizing online text-to-video dictionaries, which inherently hold annotated data of various attributes and sign languages, and training models in a few-shot fashion hence poses a promising path for the democratization of this technology. In this work, we collect and open-source the UWB-SL-Wild few-shot dataset, the first of its kind training resource consisting of dictionary-scraped videos. This dataset represents the actual distribution and characteristics of available online sign language data. We select glosses that directly overlap with the already existing datasets WLASL100 and ASLLVD and share their class mappings to allow for transfer learning experiments. Apart from providing baseline results on a pose-based architecture, we introduce a novel approach to training sign language recognition models in a few-shot scenario, resulting in state-of-the-art results on ASLLVD-Skeleton and ASLLVD-Skeleton-20 datasets with top-1 accuracy of $30.97~\%$ and $95.45~\%$, respectively.
翻译:今天的手语识别模式要求大量培训实验室式视频公司,其收集工作涉及大量劳动力和财政资源。因此,只有少数这类系统可以公开使用,更不用说其用于人口较少的手语的有限本地化能力。我们使用在线文本到视频词典,这些词典本身就拥有各种属性和手语的附加说明数据,而培训模式则以几张快照的方式为这种技术的民主化开辟了一条充满希望的道路。在这项工作中,我们收集并公开提供UWB-SL-Wild几发数据集,这是它由字典剪贴的视频组成的首个原始培训资源。这个数据集代表了现有在线手语数据的实际分布和特点。我们选择了直接与现有数据集WLASL100和ASLVDD直接重叠的遗漏,并分享了它们的班级图,以便能够转移学习实验。除了提供基于布局的架构的基线结果外,我们还引入了一种新型方法,在几幅图片情景中培训签名语言识别模型,这是由字典拼写出来的视频-20美元、SLV-D-SLS-D最高数据和SLAS-D-SLAS-S-SLAS-S-S-S-S-S-SlAS-S-S-SlAS-S-S-S-SlAS-S-S-S-S-SlAS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-