In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene. Despite recent advances, generating the entire scene in an end-to-end fashion is still far from being achieved. Instead, here we follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently. We leverage a per-object novel view synthesis paradigm; i.e. generating a synthetic representation of an object undergoing a geometrical roto-translation in the 3D space. Our model can be easily conditioned with constraints (e.g. input trajectories) provided by state-of-the-art tracking methods or by the user itself. This allows us to generate a set of diverse realistic futures starting from the same input in a multi-modal fashion. We visually and quantitatively show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow, a challenging real world dataset.
翻译:在这项工作中,我们提出一个深层次的学习管道,以预测城市景象的未来景象。尽管最近取得了一些进步,但以端到端的方式生成整个场景的工作还远远没有实现。相反,我们在此采取两个阶段的办法,即将可解释的信息纳入环形中,并且每个行为者都是独立模型。我们利用一个每个物体的新观点合成模式;即对一个在3D空间进行几何旋转转换的物体进行合成描述。我们的模型很容易受到由最先进的跟踪方法或用户本身提供的制约(例如输入轨迹)的制约。这使我们能够产生一套不同现实的未来,从同一个输入开始,以多种模式开始。我们从视觉上和数量上都展示了这一方法优于CityFlow上传统的端到端的场景生成方法,这是一个充满挑战的真正世界数据集。