This paper deals with the scarcity of data for training optical flow networks, highlighting the limitations of existing sources such as labeled synthetic datasets or unlabeled real videos. Specifically, we introduce a framework to generate accurate ground-truth optical flow annotations quickly and in large amounts from any readily available single real picture. Given an image, we use an off-the-shelf monocular depth estimation network to build a plausible point cloud for the observed scene. Then, we virtually move the camera in the reconstructed environment with known motion vectors and rotation angles, allowing us to synthesize both a novel view and the corresponding optical flow field connecting each pixel in the input image to the one in the new frame. When trained with our data, state-of-the-art optical flow networks achieve superior generalization to unseen real data compared to the same models trained either on annotated synthetic datasets or unlabeled videos, and better specialization if combined with synthetic images.
翻译:本文论述用于培训光学流网络的数据稀缺问题,强调了现有来源的局限性,如标签合成数据集或未贴标签的真实视频。 具体地说, 我们引入了一个框架, 以便快速生成准确的地面真实光学流动说明, 并且从任何现成的单一真实图片中大量生成。 有了一张图像, 我们使用现成单眼深度估计网络, 为观测到的场景构建一个可信的点云层。 然后, 我们几乎用已知的运动矢量和旋转角度在重建的环境中移动相机, 使我们能够合成新颖的视图和相应的光学流场, 将输入图像中的每个像素连接到新框中的像素。 当我们接受数据培训时, 最新光学流动网络与在附加说明的合成数据集或未贴标签的视频上训练的模型相比, 更接近于看不见的真实数据。 如果与合成图像相结合, 则更专业化。