Over recent years, deep learning-based computer vision systems have been applied to images at an ever-increasing pace, oftentimes representing the only type of consumption for those images. Given the dramatic explosion in the number of images generated per day, a question arises: how much better would an image codec targeting machine-consumption perform against state-of-the-art codecs targeting human-consumption? In this paper, we propose an image codec for machines which is neural network (NN) based and end-to-end learned. In particular, we propose a set of training strategies that address the delicate problem of balancing competing loss functions, such as computer vision task losses, image distortion losses, and rate loss. Our experimental results show that our NN-based codec outperforms the state-of-the-art Versa-tile Video Coding (VVC) standard on the object detection and instance segmentation tasks, achieving -37.87% and -32.90% of BD-rate gain, respectively, while being fast thanks to its compact size. To the best of our knowledge, this is the first end-to-end learned machine-targeted image codec.
翻译:近年来,以深层次学习为基础的计算机视觉系统以越来越快的速度应用到图像中,往往代表了这些图像的唯一消费类型。鉴于每天产生的图像数量剧增,出现一个问题:针对针对人类消费的最先进的机器消费代码的图像编码比针对最先进的Versa-tile视频编码(VC)的图像编码要好得多?在本文件中,我们为基于神经网络(NN)和从终端到终端的机器提出了一个图像编码。特别是,我们提出了一套培训战略,以解决平衡计算机视觉任务损失、图像扭曲损失和率损失等相竞损失功能的微妙问题。我们的实验结果表明,我们的NNN编码比针对最先进的Versa-tile视频编码(VC)的物体探测和图像分解任务标准要好得多,达到37.87%和32.90%的BD-Rate收益,同时由于它的紧凑大小,我们最了解的是,这是第一个端到端的机器目标图像。