A Transformer-based Image Compression (TIC) approach is developed which reuses the canonical variational autoencoder (VAE) architecture with paired main and hyper encoder-decoders. Both main and hyper encoders are comprised of a sequence of neural transformation units (NTUs) to analyse and aggregate important information for more compact representation of input image, while the decoders mirror the encoder-side operations to generate pixel-domain image reconstruction from the compressed bitstream. Each NTU is consist of a Swin Transformer Block (STB) and a convolutional layer (Conv) to best embed both long-range and short-range information; In the meantime, a casual attention module (CAM) is devised for adaptive context modeling of latent features to utilize both hyper and autoregressive priors. The TIC rivals with state-of-the-art approaches including deep convolutional neural networks (CNNs) based learnt image coding (LIC) methods and handcrafted rules-based intra profile of recently-approved Versatile Video Coding (VVC) standard, and requires much less model parameters, e.g., up to 45% reduction to leading-performance LIC.
翻译:以变换器为基础的图像压缩法(TIC) 方法,该方法重新使用配对主机和超正解码解码器(VAE)结构,对主机和超正解码器(VAE)结构进行再利用。主要和超正解码器由神经变形器序列组成,用于分析和汇总重要信息,以便更紧凑地显示输入图像,而解码器则反映编码器侧面操作,以便从压缩的位流中产生像素面图像重建。每个NTU都包括一个双变形器块(STB)和卷层(Conv),以更好地嵌入长程和短程信息;与此同时,设计了一个随机注意模块(CAM),用于对潜在特征进行适应性环境建模,以便利用超反向和自动的先前图像,同时,而调解码器反应器则反映最新核准的VERSatile Coding(VC) 标准和低频-LIC.LC.