Medical vision-and-language pre-training (Med-VLP) has shown promising improvements on many downstream medical tasks owing to its applicability to extracting generic representations from medical images and texts. Practically, there exist two typical types, \textit{i.e.}, the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. The former is superior at multi-modal tasks owing to the sufficient interaction between modalities; the latter is good at uni-modal and cross-modal tasks due to the single-modality encoding ability. To take advantage of these two types, we propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts. By doing so, a single model could serve as a \textit{foundation model} that processes various tasks adopting different input formats (\textit{i.e.}, image-only, text-only, and image-text-pair). Furthermore, we construct a prompt pool (instead of static ones) to improve diversity and scalability. Experimental results show that our approach achieves state-of-the-art results on a broad range of tasks, spanning uni-modal tasks (\textit{i.e.}, image/text classification and text summarization), cross-modal tasks (\textit{i.e.}, image-to-text generation and image-text/text-image retrieval), and multi-modal tasks (\textit{i.e.}, visual question answering), demonstrating the effectiveness of our approach. Note that the adoption of prompts is orthogonal to most existing Med-VLP approaches and could be a beneficial and complementary extension to these approaches.
翻译:医学视觉和语言前培训 (Med-VLP) 在许多下游医疗任务上显示出了有希望的改进, 因为它适用于从医疗图像和文本中提取通用表示。 实际上, 存在两种典型类型, 即 \ textit{ i. e.}, 聚合- 编码类型和双编码类型, 取决于是否使用过重聚模块。 前者在多式任务中处于优势, 原因是各种模式之间有足够的互动; 后者在单式和跨式的医疗任务方面表现良好。 为了利用这两种类型, 我们提议了一个名为 PTUniter 的有效但又直截了当的方案来统一这两种类型。 我们首先通过引入视觉和文本提示来统一输入格式。 这样, 单一模型可以起到一种多式( text{ reformation) 模式的作用, (\ text{ {i) 和跨式版本 格式( text{i. i.) 、 图像- pholy- 和图像- true- liveral- tal- tal- tal- laftal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- tal- sal- tal- tal- tal- tal- sal- 。