While humans can effortlessly transform complex visual scenes into simple words and the other way around by leveraging their high-level understanding of the content, conventional or the more recent learned image compression codecs do not seem to utilize the semantic meanings of visual content to their full potential. Moreover, they focus mostly on rate-distortion and tend to underperform in perception quality especially in low bitrate regime, and often disregard the performance of downstream computer vision algorithms, which is a fast-growing consumer group of compressed images in addition to human viewers. In this paper, we (1) present a generic framework that can enable any image codec to leverage high-level semantics and (2) study the joint optimization of perception quality and distortion. Our idea is that given any codec, we utilize high-level semantics to augment the low-level visual features extracted by it and produce essentially a new, semantic-aware codec. We propose a three-phase training scheme that teaches semantic-aware codecs to leverage the power of semantic to jointly optimize rate-perception-distortion (R-PD) performance. As an additional benefit, semantic-aware codecs also boost the performance of downstream computer vision algorithms. To validate our claim, we perform extensive empirical evaluations and provide both quantitative and qualitative results.
翻译:虽然人类可以不费力地将复杂的视觉场景转换成简单的单词,而反过来则通过利用其对内容的高度理解,将复杂的视觉场景转换成简单的单词和反向方式。 在本文中,我们(1) 提出了一个通用框架,使任何图像解析器都能利用高层次的语义来利用高层次的语义学和图像压缩压缩调解码器,并且(2) 研究共同优化感知质量和扭曲。此外,我们的想法是,在任何代码学中,我们使用高层次的语义学来增加从它中提取的低层次的视觉特征,并产生一种基本的新型的语义学质量调算法。我们提议了一个三阶段培训计划,教它除了人类的观众外,还教一个快速增长的消费者群,压缩的图像组群,以利用语义学的力量来利用高层次的语义调解码学,以及(2) 研究共同优化感知觉质量和扭曲感官质量。我们的想法是,我们使用高层次的语义学的演算法,我们又做了一个额外的测算结果。