Translating spoken languages into Sign languages is necessary for open communication between the hearing and hearing-impaired communities. To achieve this goal, we propose the first method for animating a text written in HamNoSys, a lexical Sign language notation, into signed pose sequences. As HamNoSys is universal, our proposed method offers a generic solution invariant to the target Sign language. Our method gradually generates pose predictions using transformer encoders that create meaningful representations of the text and poses while considering their spatial and temporal information. We use weak supervision for the training process and show that our method succeeds in learning from partial and inaccurate data. Additionally, we offer a new distance measurement for pose sequences, normalized Dynamic Time Warping (nDTW), based on DTW over normalized keypoints trajectories, and validate its correctness using AUTSL, a large-scale Sign language dataset. We show that it measures the distance between pose sequences more accurately than existing measurements and use it to assess the quality of our generated pose sequences. Code for the data pre-processing, the model, and the distance measurement is publicly released for future research.
翻译:将口语转换为手语对于听力和听力障碍社区之间的公开交流是必要的。 为了实现这一目标,我们提出第一个方法,将用HamNoSys(词汇手语符号符号)写成的文字进行动画,以签署组合顺序。由于HamNoSys(HamNoSys)的普遍性,我们提议的方法提供了一个通用解决方案,对目标手语有不同之处。我们的方法逐渐利用变压器编码器生成预测,在考虑其空间和时间信息的同时,对文本进行有意义的表达和显示。我们使用薄弱的监管程序,并表明我们的方法能够从部分和不准确的数据中学习成功。此外,我们还根据DTW对正常的键点进行正常动态时间扭曲(nDTW),对方位进行新的距离测量,并用AUTSL(一个大规模信号语言数据集)验证其正确性。我们显示,它测量方位序列之间的距离比现有测量方法更准确,并用来评估我们生成的方位序列的质量。数据预处理、模型和远程测量方法将公开发布用于未来研究。