We present IMU2CLIP, a novel pre-training approach to align Inertial Measurement Unit (IMU) motion sensor recordings with video and text, by projecting them into the joint representation space of Contrastive Language-Image Pre-training (CLIP). The proposed approach allows IMU2CLIP to translate human motions (as measured by IMU sensors) into their corresponding textual descriptions and videos -- while preserving the transitivity across these modalities. We explore several new IMU-based applications that IMU2CLIP enables, such as motion-based media retrieval and natural language reasoning tasks with motion data. In addition, we show that IMU2CLIP can significantly improve the downstream performance when fine-tuned for each application (e.g. activity recognition), demonstrating the universal usage of IMU2CLIP as a new pre-trained resource. Our code will be made publicly available.
翻译:我们提出IMU2CLIP(IMU2CLIP),这是一个新的培训前方法,将惯性计量股(IMU)运动感应器记录与视频和文字统一起来,将其投放到相互抵触的语言图像培训前联合代表空间(CLIP)中。 拟议的方法使IMU2CLIP(以IMU传感器测量)将人文动议(由IMU传感器测量)转化为相应的文字描述和视频,同时保持这些模式的中转性。我们探索IMU2CLIP(IMU)所允许的一些新的基于IMU的应用程序,如移动媒体检索和自然语言推理任务,以及运动数据。此外,我们表明,IMU2CLIP(IMU2CLIP)在对每项应用进行微调(如活动识别)时,可以显著改善下游的绩效,以展示IMU2CLIP作为新的培训前资源的普遍使用。我们的代码将被公布。