We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application. Instead of placing an existing Transformer-based image classification model directly after an image codec, we aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer. Specifically, we first replace the patchify stem (i.e., image splitting and embedding) of the ViT model with a lightweight image encoder modelled by a convolutional neural network. The compressed features generated by the image encoder are injected convolutional inductive bias and are fed to the Transformer for image classification bypassing image reconstruction. Meanwhile, we propose a feature aggregation module to fuse the compressed features with the selected intermediate features of the Transformer, and feed the aggregated features to a deconvolutional neural network for image reconstruction. The aggregated features can obtain the long-term information from the self-attention mechanism of the Transformer and improve the compression performance. The rate-distortion-accuracy optimization problem is finally solved by a two-step training strategy. Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.
翻译:我们提出一个端到端图像压缩和分析模型,以变压器为对象,针对基于云的图像分类应用程序。我们的目标不是在图像编码器之后直接放置基于变压器的现有图像分类模型,而是重新设计“View 变压器”模型,从压缩功能中进行图像分类,用变压器的长期信息为图像压缩提供便利。具体地说,我们首先将ViT模型的整形干(即图像分割和嵌入)替换成一个轻质图像编码器,由共振神经网络模拟。图像编码器产生的压缩功能注入了进化导导偏向性偏向,并被输入到变压器上,用于图像重建的图像分类。与此同时,我们提议了一个功能组合模块,将压缩特性与选定的变压器中间特征结合,并将综合特征输入一个用于图像重建的分层神经网络。综合特征可以从变压器的自我感应机制中获得长期信息,并改进压缩性性工作。通过两个步骤的实验性测试结果,最终解决了拟议图像的压缩效果。