In this work, we present MoLang (a Motion-Language connecting model) for learning joint representation of human motion and language, leveraging both unpaired and paired datasets of motion and language modalities. To this end, we propose a motion-language model with contrastive learning, empowering our model to learn better generalizable representations of the human motion domain. Empirical results show that our model learns strong representations of human motion data through navigating language modality. Our proposed method is able to perform both action recognition and motion retrieval tasks with a single model where it outperforms state-of-the-art approaches on a number of action recognition benchmarks.
翻译:在这项工作中,我们介绍了MoLang(运动-语言连接模型),以学习人类运动和语言的共同代表,同时利用未受重视的和对齐的运动和语言模式数据集,为此,我们提出了一个具有对比性学习的运动语言模型,赋予我们的模型以学习人类运动领域更普遍、更能反映人类运动领域的能力。经验性结果显示,我们的模型通过导航语言模式学会了人类运动数据的有力表述。我们提出的方法能够用单一模型执行行动识别和运动检索任务,该模型在一系列行动识别基准上优于最先进的方法。