This study achieved bidirectional translation between descriptions and actions using small paired data from different modalities. The ability to mutually generate descriptions and actions is essential for robots to collaborate with humans in their daily lives, which generally requires a large dataset that maintains comprehensive pairs of both modality data. However, a paired dataset is expensive to construct and difficult to collect. To address this issue, this study proposes a two-stage training method for bidirectional translation. In the proposed method, we train recurrent autoencoders (RAEs) for descriptions and actions with a large amount of non-paired data. Then, we finetune the entire model to bind their intermediate representations using small paired data. Because the data used for pre-training do not require pairing, behavior-only data or a large language corpus can be used. We experimentally evaluated our method using a paired dataset consisting of motion-captured actions and descriptions. The results showed that our method performed well, even when the amount of paired data to train was small. The visualization of the intermediate representations of each RAE showed that similar actions were encoded in a clustered position and the corresponding feature vectors were well aligned.
翻译:这项研究利用不同模式的小型对称数据实现了描述与行动之间的双向双向翻译; 相互生成描述和行动的能力对于机器人在日常生活中与人类合作至关重要, 这通常需要庞大的数据集, 维持两种模式数据的综合配对; 但是, 配对数据集的构建成本昂贵, 且难以收集。 为了解决这个问题, 本研究提出了双向翻译的两阶段培训方法。 在拟议方法中, 我们为描述和行动培训经常性自动解译器(RAEs), 包含大量非对称数据。 然后, 我们微调整个模型, 以便用小型对称数据捆绑其中间显示。 因为培训前使用的数据并不要求配对, 行为专用数据或大语言文体可以使用。 我们实验性地评估了我们的方法, 由运动定位行动和描述组成的对称数据集。 研究结果表明, 我们的方法表现良好, 即使配对数据的数量很小。 每一对称的中间剖面图显示, 相近的矢量的矢量显示, 相近的矢量的矢量是完全一致的矢量。