Masked Autoencoder (MAE) has recently been shown to be effective in pre-training Vision Transformers (ViT) for natural image analysis. By reconstructing full images from partially masked inputs, a ViT encoder aggregates contextual information to infer masked image regions. We believe that this context aggregation ability is particularly essential to the medical image domain where each anatomical structure is functionally and mechanically connected to other structures and regions. Because there is no ImageNet-scale medical image dataset for pre-training, we investigate a self pre-training paradigm with MAE for medical image analysis tasks. Our method pre-trains a ViT on the training set of the target data instead of another dataset. Thus, self pre-training can benefit more scenarios where pre-training data is hard to acquire. Our experimental results show that MAE self pre-training markedly improves diverse medical image tasks including chest X-ray disease classification, abdominal CT multi-organ segmentation, and MRI brain tumor segmentation. Code is available at https://github.com/cvlab-stonybrook/SelfMedMAE
翻译:----
自编码器 (MAE) 最近被证明对于自然图像分析的 Transformer 预训练非常有效。通过从部分遮盖的输入重构整张图像,Transformer 编码器聚合上下文信息以推断遮盖的图像区域。我们认为这种上下文聚合能力在医学图像领域特别重要,因为每个解剖结构与其他结构和区域功能上都有联系。由于缺少 ImageNet 级别的医学图像数据集可供预训练,我们探究了一种自我预训练范式:采用 MAE 在目标数据的训练集上进行预训练,而不是使用其他数据集。因此,自我预训练可以更好地受益于预训练数据难以获取的情形。我们的实验结果表明,MAE 的自我预训练显著改善了不同的医学图像任务,包括胸部 X 光疾病分类、腹部 CT 多器官分割和 MRI 脑肿瘤分割。代码可在 https://github.com/cvlab-stonybrook/SelfMedMAE 上获取。