Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric. Since it is not differentiable, we usually instead optimize the learning model with the connectionist temporal classification (CTC) objective loss, which maximizes the posterior probability over the sequential alignment. Due to the optimization gap, the predicted sentence with the highest decoding probability may not be the best choice under the WER metric. To tackle this issue, we propose a novel architecture with cross modality augmentation. Specifically, we first augment cross-modal data by simulating the calculation procedure of WER, i.e., substitution, deletion and insertion on both text label and its corresponding video. With these real and generated pseudo video-text pairs, we propose multiple loss terms to minimize the cross modality distance between the video and ground truth label, and make the network distinguish the difference between real and pseudo modalities. The proposed framework can be easily extended to other existing CTC based continuous SLR architectures. Extensive experiments on two continuous SLR benchmarks, i.e., RWTH-PHOENIX-Weather and CSL, validate the effectiveness of our proposed method.
翻译:连续手语识别( SLR) 涉及不对齐的视频文本对应, 并使用字错误率( WER), 即编辑距离, 作为主要的评价衡量标准。 由于它无法区分, 我们通常会以连接器时间分类(CTC) 客观损失优化学习模式, 从而在顺序对齐中将后继概率最大化。 由于优化差距, 预计的解码概率最高的句子可能不是WER衡量标准下的最佳选择。 为了解决这一问题, 我们提议了一个具有跨模式增强功能的新结构。 具体地说, 我们首先通过模拟 WER的计算程序, 即替换、 删除和插入文本标签及其相应视频, 来增加跨模式的数据。 有了这些真实的和生成的假视频文本配对, 我们提出了多个损失术语, 以尽量减少视频和地面真相标签之间的交叉模式距离, 并使网络区分真实和假模式之间的差异。 为了解决这个问题, 我们提议的框架可以很容易扩展到基于其他连续的 CLR 结构。 在两个连续的 SLR 基准上进行广泛的实验, 即 C. RWS- PH- IS- IS- silvivalviewal 和我们拟议的C 方法的C.WER-WER- Wea- Wea 和 CWervivaldal- Wea view