Multimodal models trained on large natural image-text pair datasets have exhibited astounding abilities in generating high-quality images. Medical imaging data is fundamentally different to natural images, and the language used to succinctly capture relevant details in medical data uses a different, narrow but semantically rich, domain-specific vocabulary. Not surprisingly, multi-modal models trained on natural image-text pairs do not tend to generalize well to the medical domain. Developing generative imaging models faithfully representing medical concepts while providing compositional diversity could mitigate the existing paucity of high-quality, annotated medical imaging datasets. In this work, we develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays (CXR) and their corresponding radiology (text) reports. We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts. We assess the model outputs quantitatively using image quality metrics, and evaluate image quality and text-image alignment by human domain experts. We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images, and that the output can be controlled to a new extent by using free-form text prompts including radiology-specific language. Fine-tuning this model on a fixed training set and using it as a data augmentation method, we measure a 5% improvement of a classifier trained jointly on synthetic and real images, and a 3% improvement when trained on a larger but purely synthetic training set. Finally, we observe that this fine-tuning distills in-domain knowledge in the text-encoder and can improve its representation capabilities of certain diseases like pneumothorax by 25%.
翻译:在大型天然图像-文本配对数据集方面受过培训的多式模型在生成高质量图像方面表现出惊人的能力。医学成像数据与自然图像有根本的不同,而用于简明地捕捉医疗数据相关细节的语言使用一种不同、狭窄但精度丰富的域名词汇。在天然图像-文本配对方面受过培训的多式模型往往不会向医学领域广泛推广。开发忠实地代表医学概念的基因化成像模型,同时提供成像多样性,可以减轻目前缺乏的高质量、附加说明的医学成像数据集。在这项工作中,我们制定战略,通过在公众可获取的胸部X射线(CXR)及其相应的放射学(文本)的堆中修改经过事先训练的潜在扩散模型来克服大程度的自然-医学分布变化。我们调查该模型是否有能力生成某种高纤维化、多种合成的CXRR,我们用图像质量衡量质量改进,评价图像质量和文本比对人类域专家的精确度调整。我们用一个经过训练的精度度测量的模型(RodG-ralalalimalalal lial lial lial lial lial lide lide) 数据在使用这个模型中可以使我们能够建立一个可令人信质化的模型的模型上,一个可令人信质化的C。