Stereo-matching is a fundamental problem in computer vision. Despite recent progress by deep learning, improving the robustness is ineluctable when deploying stereo-matching models to real-world applications. Different from the common practices, i.e., developing an elaborate model to achieve robustness, we argue that collecting multiple available datasets for training is a cheaper way to increase generalization ability. Specifically, this report presents an improved RaftStereo trained with a mixed dataset of seven public datasets for the robust vision challenge (denoted as iRaftStereo_RVC). When evaluated on the training sets of Middlebury, KITTI-2015, and ETH3D, the model outperforms its counterparts trained with only one dataset, such as the popular Sceneflow. After fine-tuning the pre-trained model on the three datasets of the challenge, it ranks at 2nd place on the stereo leaderboard, demonstrating the benefits of mixed dataset pre-training.
翻译:尽管最近通过深层学习取得了进步,但在将立体匹配模型运用到现实世界应用中,提高稳健性是不可避免的。 不同于通常的做法,即开发完善的模型以实现稳健性,我们争辩说,收集多种可用数据集用于培训是提高概括化能力的更廉价的方法。 具体而言,本报告展示了经过培训的拉夫特-斯特雷奥(Raft-Stereo_RVC)七套混合数据集来应对稳健的愿景挑战。 在对Middlebury、KITTI-2015和ETH3D的培训组合进行评估时,模型优于只受过一种数据集培训的对应方,如流行的Scene流。在对三个挑战数据集的预先培训模型进行微调后,它排在立体头第二位,展示了混合数据集预培训的好处。