Self-supervised learning has attracted increasing attention as it learns data-driven representation from data without annotations. Vision transformer-based autoencoder (ViT-AE) by He et al. (2021) is a recent self-supervised learning technique that employs a patch-masking strategy to learn a meaningful latent space. In this paper, we focus on improving ViT-AE (nicknamed ViT-AE++) for a more effective representation of both 2D and 3D medical images. We propose two new loss functions to enhance the representation during the training stage. The first loss term aims to improve self-reconstruction by considering the structured dependencies and hence indirectly improving the representation. The second loss term leverages contrastive loss to directly optimize the representation from two randomly masked views. As an independent contribution, we extended ViT-AE++ to a 3D fashion for volumetric medical images. We extensively evaluate ViT-AE++ on both natural images and medical images, demonstrating consistent improvement over vanilla ViT-AE and its superiority over other contrastive learning approaches.
翻译:由He等人(2021年)以愿景变压器为基础的自动编码器(ViT-AE)是最近一种自我监督的学习技术,采用修补假造策略学习有意义的潜在空间。在本文中,我们的重点是改进ViT-AE(Nickname Vit-AE++),以便更有效地展示2D和3D医疗图像。我们提议了两个新的损失功能,以便在培训阶段加强代表性。第一个损失术语旨在通过考虑结构依赖性从而间接改善代表性来改进自我重建。第二个损失术语利用对比性损失直接优化两种随机遮蔽观点的代表性。作为一项独立贡献,我们将ViT-AE++(Nickname ViT-AE++)推广到3D时段,用于量医学图像。我们广泛评价自然图像和医学图像的ViT-AE+,表明比Vanla Vit-AE及其优于其他对比性学习方法。