In this paper, we address the problem of forecasting the trajectory of an egocentric camera wearer (ego-person) in crowded spaces. The trajectory forecasting ability learned from the data of different camera wearers walking around in the real world can be transferred to assist visually impaired people in navigation, as well as to instill human navigation behaviours in mobile robots, enabling better human-robot interactions. To this end, a novel egocentric human trajectory forecasting dataset was constructed, containing real trajectories of people navigating in crowded spaces wearing a camera, as well as extracted rich contextual data. We extract and utilize three different modalities to forecast the trajectory of the camera wearer, i.e., his/her past trajectory, the past trajectories of nearby people, and the environment such as the scene semantics or the depth of the scene. A Transformer-based encoder-decoder neural network model, integrated with a novel cascaded cross-attention mechanism that fuses multiple modalities, has been designed to predict the future trajectory of the camera wearer. Extensive experiments have been conducted, with results showing that our model outperforms the state-of-the-art methods in egocentric human trajectory forecasting.
翻译:在本文中,我们探讨了在拥挤的空间预测一个以自我为中心的照相机磨损机(ego-person)的轨迹的问题;从现实世界中不同摄影机磨损机的数据中得出的轨迹预测能力可以转移,以帮助航行中的视力受损者,以及在移动机器人中灌输人类导航行为,从而能够改善人类-机器人的互动;为此,建立了一个以自我为中心的人类轨迹预测数据组,其中包含了在拥挤的空间中穿戴相机的人的真实轨迹,以及提取丰富的背景数据;我们提取并使用了三种不同的方式来预测摄影机的轨迹,即:他/她的过去轨迹、附近人过去的轨迹以及环境,例如现场的语义学或场底深度。一个以变异器为基础的编码器脱coder神经网络模型,与一个将多种模式结合起来的新式的级联交叉注意机制相结合,目的是预测摄影机的未来轨迹。已经进行了广泛的实验,其结果显示我们模型的轨迹将超越了人际轨道。