Cross-modal representation learning learns a shared embedding between two or more modalities to improve performance in a given task compared to using only one of the modalities. Cross-modal representation learning from different data types -- such as images and time-series data (e.g., audio or text data) -- requires a deep metric learning loss that minimizes the distance between the modality embeddings. In this paper, we propose to use the triplet loss, which uses positive and negative identities to create sample pairs with different labels, for cross-modal representation learning between image and time-series modalities (CMR-IS). By adapting the triplet loss for cross-modal representation learning, higher accuracy in the main (time-series classification) task can be achieved by exploiting additional information of the auxiliary (image classification) task. Our experiments on synthetic data and handwriting recognition data from sensor-enhanced pens show improved classification accuracy, faster convergence, and better generalizability.
翻译:跨模式代表性学习学会了两种或两种以上模式的共同嵌入,以提高特定任务的业绩,而不是仅仅使用一种模式。从不同数据类型 -- -- 如图像和时间序列数据(如音频或文本数据) -- -- 进行跨模式代表性学习需要深度的衡量学习损失,以尽量减少模式嵌入之间的距离。在本文中,我们提议使用三重损失,即使用正和负特性来创建带有不同标签的样本配对,用于图像和时间序列模式(CMR-IS)之间的交叉模式代表性学习。通过将三重损失调整为跨模式代表性学习,可以通过利用辅助(图像分类)任务的额外信息实现主要(时间序列分类)任务的更准确性。我们在从传感器强化的笔中合成数据和笔迹识别数据方面的实验显示,分类准确性、更快的趋同性以及更通用性。