Predicting future sensory states is crucial for learning agents such as robots, drones, and autonomous vehicles. In this paper, we couple multiple sensory modalities with exploratory actions and propose a predictive neural network architecture to address this problem. Most existing approaches rely on large, manually annotated datasets, or only use visual data as a single modality. In contrast, the unsupervised method presented here uses multi-modal perceptions for predicting future visual frames. As a result, the proposed model is more comprehensive and can better capture the spatio-temporal dynamics of the environment, leading to more accurate visual frame prediction. The other novelty of our framework is the use of sub-networks dedicated to anticipating future haptic, audio, and tactile signals. The framework was tested and validated with a dataset containing 4 sensory modalities (vision, haptic, audio, and tactile) on a humanoid robot performing 9 behaviors multiple times on a large set of objects. While the visual information is the dominant modality, utilizing the additional non-visual modalities improves the accuracy of predictions.
翻译:预测未来感官状态对于机器人、无人驾驶飞机和自主飞行器等学习剂至关重要。 在本文中, 我们将多种感知模式与探索行动相结合, 并提出一个预测性神经网络架构来解决这个问题。 大多数现有方法都依赖于大型、 手动附加说明的数据集, 或仅将视觉数据作为单一模式使用。 相反, 这里所介绍的未经监督的方法使用多种模式的认知来预测未来视觉框架。 因此, 拟议的模型更加全面, 能够更好地捕捉环境的时空动态, 导致更准确的视觉框架预测。 我们框架的另一个新颖之处是使用子网络来专门预测未来的机能性、 音频和触摸信号。 框架经过测试和验证, 由包含4种感知模式( 视觉、 机能、 音频和触摸器) 的数据集对一个在大型物体组上多次执行9项行为的人类机器人进行了测试和验证。 视觉信息是主要模式, 使用额外的非视觉模式, 提高预测的准确性。