Sign language recognition using computational models is a challenging problem that requires simultaneous spatio-temporal modeling of the multiple sources, i.e. faces, hands, body, etc. In this paper, we propose an isolated sign language recognition model based on a model trained using Motion History Images (MHI) that are generated from RGB video frames. RGB-MHI images represent spatio-temporal summary of each sign video effectively in a single RGB image. We propose two different approaches using this RGB-MHI model. In the first approach, we use the RGB-MHI model as a motion-based spatial attention module integrated into a 3D-CNN architecture. In the second approach, we use RGB-MHI model features directly with the features of a 3D-CNN model using a late fusion technique. We perform extensive experiments on two recently released large-scale isolated sign language datasets, namely AUTSL and BosphorusSign22k. Our experiments show that our models, which use only RGB data, can compete with the state-of-the-art models in the literature that use multi-modal data.
翻译:使用计算模型的手语识别是一个具有挑战性的问题,需要同时对多种来源,即面部、手部、身体等进行时空建模。在本文件中,我们建议采用一个孤立的手语识别模型,该模型以使用RGB视频框架生成的动态历史图像(MHI)所培训的模型为基础。 RGB-MHI 图像代表了在单一 RGB 图像中有效制作的每个手语视频的时空简略。我们建议使用RGB-MHI 模型的两种不同方法。在第一个方法中,我们使用RGB-MHI模型,作为3D-CNN结构中以运动为基础的空间关注模块。在第二个方法中,我们直接使用RGB-MHI模型特征,与3D-CNN模型的特征直接使用迟发聚技术。我们对最近发行的两个大型孤立手语语言数据集,即AUTSL和BosphorusSign22k进行了广泛的实验。我们的实验显示,我们的模型只使用RGB数据,可以与使用多模式中使用多模式的文献中的状态模型竞争。