Modern mobile burst photography pipelines capture and merge a short sequence of frames to recover an enhanced image, but often disregard the 3D nature of the scene they capture, treating pixel motion between images as a 2D aggregation problem. We show that in a "long-burst", forty-two 12-megapixel RAW frames captured in a two-second sequence, there is enough parallax information from natural hand tremor alone to recover high-quality scene depth. To this end, we devise a test-time optimization approach that fits a neural RGB-D representation to long-burst data and simultaneously estimates scene depth and camera motion. Our plane plus depth model is trained end-to-end, and performs coarse-to-fine refinement by controlling which multi-resolution volume features the network has access to at what time during training. We validate the method experimentally, and demonstrate geometrically accurate depth reconstructions with no additional hardware or separate data pre-processing and pose-estimation steps.
翻译:现代移动爆破摄影管道捕捉和合并一个短框架序列以恢复增强的图像,但往往忽视它们所捕捉的场景的3D性质,将图像之间的像素运动作为2D聚合问题处理。我们显示,在“长发”、42个12兆像素的RAW框架中,用二秒钟的顺序捕捉到的自然手震颤本身有足够的参数信息来恢复高质量的场景深度。为此,我们设计了一个测试-时间优化方法,让神经RGB-D代表适合长发数据,并同时估计现场深度和摄影机运动。我们的平面加深度模型经过培训,通过控制多分辨率的网络在培训期间能够进入的时间进行粗微的改进。我们实验了这种方法,并用没有额外的硬件或不同的数据预处理和方位估测步骤来演示几何精确的深度重建。