Novel view synthesis is required in many robotic applications, such as VR teleoperation and scene reconstruction. Existing methods are often too slow for these contexts, cannot handle dynamic scenes, and are limited by their explicit depth estimation stage, where incorrect depth predictions can lead to large projection errors. Our proposed method runs in real time on live streaming data and avoids explicit depth estimation by efficiently warping input images into the target frame for a range of assumed depth planes. The resulting plane sweep volume (PSV) is directly fed into our network, which first estimates soft PSV masks in a self-supervised manner, and then directly produces the novel output view. This improves efficiency and performance on transparent, reflective, thin, and feature-less scene parts. FaDIV-Syn can perform both interpolation and extrapolation tasks at 540p in real-time and outperforms state-of-the-art extrapolation methods on the large-scale RealEstate10k dataset. We thoroughly evaluate ablations, such as removing the Soft-Masking network, training from fewer examples as well as generalization to higher resolutions and stronger depth discretization. Our implementation is available.
翻译:许多机器人应用程序(如VR远程操作和场景重建)都需要进行新视角合成。对于这些背景而言,现有方法往往过于缓慢,无法处理动态场景,并且受到其清晰深度估计阶段的限制,不正确的深度预测可能导致大预测错误。我们提议的方法实时运行在实时流数据上,避免通过将输入图像有效转换到一系列假设深度平面的目标框架来进行明确的深度估计。由此产生的飞机扫荡量(PSV)直接输入我们的网络,它首先以自我监督的方式估算软 PSV 遮罩,然后直接生成新的输出视图。这提高了透明、反射、薄和无特征的场景部分的效率和性。 FIDIV-Syn可以在实时540p执行内部的内插图和外插任务,并超越了在大规模RealEstate10k数据集上的最新外插法。我们彻底评估了布局,例如删除软式磁盘网络,从较少的例子中培训,从一般分辨率到更深入的深度。