Nowadays, many of the images captured are "observed" by machines only and not by humans, for example, robots' or autonomous cars' cameras. High-level machine vision models, such as object recognition or semantic segmentation, assume images are transformed to some canonical image space by the camera ISP. However, the camera ISP is optimized for producing visually pleasing images to human observers and not for machines, thus, one may spare the ISP compute time and apply the vision models directly on the raw data. Yet, it has been shown that training such models directly on the RAW images results in a performance drop. To mitigate this drop in performance (without the need to annotate RAW data), we use a dataset of RAW and RGB image pairs, which can be easily acquired with no human labeling. We then train a model that is applied directly on the RAW data by using knowledge distillation such that the model predictions for RAW images will be aligned with the predictions of an off-the-shelf pre-trained model for processed RGB images. Our experiments show that our performance on RAW images for object classification and semantic segmentation are significantly better than a model trained on labeled RAW images. It also reasonably matches the predictions of a pre-trained model on processed RGB images, while saving the ISP compute overhead.
翻译:目前,许多摄取的图像仅由机器而不是由人类“观察”,例如机器人或自动汽车的相机。高级机器视觉模型,例如物体识别或语义分割,假设图像由相机 ISP 转换为某种卡通图像空间。然而,相机 ISP 被优化用于为人类观察者而不是机器制作视觉上令人愉快的图像,因此,可以让 ISP 计算时间并直接应用原始数据上的视觉模型。然而,已经显示直接在 RAW 图像上对模型进行性能下降的结果培训。为了减轻这种性能下降(不需要对RAW 数据作说明),我们使用一个RAW 和 RGB 图像配对的数据集,这些数据集可以很容易地在没有人类标签的情况下获得。我们随后将一个模型直接应用到RAW 数据上, 通过知识蒸馏, 使 RAW 图像的模型预测将直接与处理的 RGB 图像的离位预训练模型的预测相一致。 我们的实验模型显示, RAW 和 RGB 图像的性能比对目标分类进行更精确的图像进行更好的分析。