The success of deep learning relies heavily on large datasets with extensive labels, but we often only have access to several small, heterogeneous datasets associated with partial labels, particularly in the field of medical imaging. When learning from multiple datasets, existing challenges include incomparable, heterogeneous, or even conflicting labeling protocols across datasets. In this paper, we propose a new initiative--"data, assemble"--which aims to unleash the full potential of partially labeled data and enormous unlabeled data from an assembly of datasets. To accommodate the supervised learning paradigm to partial labels, we introduce a dynamic adapter that encodes multiple visual tasks and aggregates image features in a question-and-answer manner. Furthermore, we employ pseudo-labeling and consistency constraints to harness images with missing labels and to mitigate the domain gap across datasets. From proof-of-concept studies on three natural imaging datasets and rigorous evaluations on two large-scale thorax X-ray benchmarks, we discover that learning from "negative examples" facilitates both classification and segmentation of classes of interest. This sheds new light on the computer-aided diagnosis of rare diseases and emerging pandemics, wherein "positive examples" are hard to collect, yet "negative examples" are relatively easier to assemble. As a result, besides exceeding the prior art in the NIH ChestXray benchmark, our model is particularly strong in identifying diseases of minority classes, yielding over 3-point improvement on average. Remarkably, when using existing partial labels, our model performance is on-par (p>0.05) with that using a fully curated dataset with exhaustive labels, eliminating the need for additional 40% annotation costs.
翻译:深层学习的成功在很大程度上依赖于大量带有广泛标签的大型数据集,但我们往往只能获得与部分标签相关的若干小型、多样化的数据集,特别是在医学成像领域。在从多个数据集学习时,现有的挑战包括无法比较的、多样化的、甚至相互矛盾的跨数据集标签协议。在本文中,我们提出了一个新的倡议-“数据,集”——旨在释放部分标签数据的全部潜力,以及一组数据集中巨大的无标签数据。为了将监督的40个学习范例纳入部分标签,我们引入了一个动态的调整器,以问答方式将多个视觉任务和综合图像特征编码。此外,我们使用假标签和一致性限制来利用缺少的标签的图像和缩小数据集之间的域间差距。从三个自然成像数据集的校验研究,以及两个大型模型式X光谱基准的严格评估,我们发现从“不良范例”中学习的“部分改进”既能促进分类,又能分解多种视觉任务和综合图像特征特征。此外,我们使用新的性能比重的模型化模型的模型分析结果,在前的模型分析中,越容易看到新的越快,越需要越近的越需要越近的越需要越近的越需要的DNA的诊断。