Sign language translation (SLT), which generates text in a spoken language from visual content in a sign language, is important to assist the hard-of-hearing community for their communications. Inspired by neural machine translation (NMT), most existing SLT studies adopted a general sequence to sequence learning strategy. However, SLT is significantly different from general NMT tasks since sign languages convey messages through multiple visual-manual aspects. Therefore, in this paper, these unique characteristics of sign languages are formulated as hierarchical spatio-temporal graph representations, including high-level and fine-level graphs of which a vertex characterizes a specified body part and an edge represents their interactions. Particularly, high-level graphs represent the patterns in the regions such as hands and face, and fine-level graphs consider the joints of hands and landmarks of facial regions. To learn these graph patterns, a novel deep learning architecture, namely hierarchical spatio-temporal graph neural network (HST-GNN), is proposed. Graph convolutions and graph self-attentions with neighborhood context are proposed to characterize both the local and the global graph properties. Experimental results on benchmark datasets demonstrated the effectiveness of the proposed method.
翻译:手语翻译(SLT)通过手语的视觉内容生成口语文本,对于帮助听力困难社区进行交流十分重要。在神经机器翻译(NMT)的启发下,大多数现有的SLT研究采用了学习战略排序的一般顺序。然而,SLT与一般NM任务大不相同,因为手语通过多种视觉和手势性方面传递信息。因此,在本文件中,手语的这些独特特征是作为等级级的时空图形表达式(包括高层次和微调的图形)而形成的,其顶部特征是特定身体部分,边缘代表其互动。特别是,高层次的图形代表了手和面部以及细微图表等区域的模式。为了了解这些图形模式,提出了一个新的深层次学习结构,即等级的spatio-时空图形神经网络(HST-GNNN),提议在所展示的本地和全球图表属性上标定的实验性结果。