Diffusion-based models for text-to-image generation have gained immense popularity due to recent advancements in efficiency, accessibility, and quality. Although it is becoming increasingly feasible to perform inference with these systems using consumer-grade GPUs, training them from scratch still requires access to large datasets and significant computational resources. In the case of medical image generation, the availability of large, publicly accessible datasets that include text reports is limited due to legal and ethical concerns. While training a diffusion model on a private dataset may address this issue, it is not always feasible for institutions lacking the necessary computational resources. This work demonstrates that pre-trained Stable Diffusion models, originally trained on natural images, can be adapted to various medical imaging modalities by training text embeddings with textual inversion. In this study, we conducted experiments using medical datasets comprising only 100 samples from three medical modalities. Embeddings were trained in a matter of hours, while still retaining diagnostic relevance in image generation. Experiments were designed to achieve several objectives. Firstly, we fine-tuned the training and inference processes of textual inversion, revealing that larger embeddings and more examples are required. Secondly, we validated our approach by demonstrating a 2\% increase in the diagnostic accuracy (AUC) for detecting prostate cancer on MRI, which is a challenging multi-modal imaging modality, from 0.78 to 0.80. Thirdly, we performed simulations by interpolating between healthy and diseased states, combining multiple pathologies, and inpainting to show embedding flexibility and control of disease appearance. Finally, the embeddings trained in this study are small (less than 1 MB), which facilitates easy sharing of medical data with reduced privacy concerns.
翻译:基于扩散的文本图像生成模型由于效率、可访问性和质量等方面的最新进展变得越来越受欢迎。虽然现在使用消费级GPU对这些系统进行推断变得越来越可行,但从头开始训练它们仍需要访问大型数据集和重要的计算资源。在医学图像生成的情况下,由于法律和伦理问题的限制,包含文本报告的大型公开可访问数据集的可用性非常有限。虽然在私有数据集上训练扩散模型可以解决这个问题,但对于缺乏必要计算资源的机构来说,这并不总是可行的。本文演示了预先训练的稳定扩散模型(最初针对自然图像进行训练)可以通过使用文本反演来适应各种医学成像模态。在本研究中,我们使用仅包含三种医学成像模态中的100个样本的医学数据集进行了实验。嵌入可以在几小时内进行训练,同时仍保留图像生成中的诊断相关性。实验的设计旨在实现几个目标。首先,我们微调了文本反演的培训和推理过程,揭示了需要更大的嵌入和更多的示例。其次,我们通过展示在检测MRI上前列腺癌的诊断准确度(AUC)增加了2%,验证了我们的方法,从0.78增加到0.80,这是具有挑战性的多模式成像模态之一。第三,我们进行了模拟,通过插值健康和患病状态,组合多种病理和修复缺陷来展示嵌入的灵活性和对疾病外观的控制。最后,本研究训练的嵌入小(小于1 MB),从而降低了医学数据共享时的隐私问题。