数据,组装:利用多种数据集与异基因和部分标签 (Data, Assemble: Leveraging Multiple Datasets with Heterogeneous and Partial Labels)

The success of deep learning relies heavily on large datasets with extensive labels, but we often only have access to several small, heterogeneous datasets associated with partial labels, particularly in the field of medical imaging. When learning from multiple datasets, existing challenges include incomparable, heterogeneous, or even conflicting labeling protocols across datasets. In this paper, we propose a new initiative--"data, assemble"--which aims to unleash the full potential of partially labeled data and enormous unlabeled data from an assembly of datasets. To accommodate the supervised learning paradigm to partial labels, we introduce a dynamic adapter that encodes multiple visual tasks and aggregates image features in a question-and-answer manner. Furthermore, we employ pseudo-labeling and consistency constraints to harness images with missing labels and to mitigate the domain gap across datasets. From proof-of-concept studies on three natural imaging datasets and rigorous evaluations on two large-scale thorax X-ray benchmarks, we discover that learning from "negative examples" facilitates both classification and segmentation of classes of interest. This sheds new light on the computer-aided diagnosis of rare diseases and emerging pandemics, wherein "positive examples" are hard to collect, yet "negative examples" are relatively easier to assemble. As a result, besides exceeding the prior art in the NIH ChestXray benchmark, our model is particularly strong in identifying diseases of minority classes, yielding over 3-point improvement on average. Remarkably, when using existing partial labels, our model performance is on-par (p>0.05) with that using a fully curated dataset with exhaustive labels, eliminating the need for additional 40% annotation costs.

翻译：深层学习的成功在很大程度上依赖于大量带有广泛标签的大型数据集,但我们往往只能获得与部分标签相关的若干小型、多样化的数据集,特别是在医学成像领域。在从多个数据集学习时,现有的挑战包括无法比较的、多样化的、甚至相互矛盾的跨数据集标签协议。在本文中,我们提出了一个新的倡议-“数据,集”——旨在释放部分标签数据的全部潜力,以及一组数据集中巨大的无标签数据。为了将监督的40个学习范例纳入部分标签,我们引入了一个动态的调整器,以问答方式将多个视觉任务和综合图像特征编码。此外,我们使用假标签和一致性限制来利用缺少的标签的图像和缩小数据集之间的域间差距。从三个自然成像数据集的校验研究,以及两个大型模型式X光谱基准的严格评估,我们发现从“不良范例”中学习的“部分改进”既能促进分类,又能分解多种视觉任务和综合图像特征特征。此外,我们使用新的性能比重的模型化模型的模型分析结果,在前的模型分析中,越容易看到新的越快,越需要越近的越需要越近的越需要越近的越需要的DNA的诊断。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

【WWW2021】神经协同推理

专知会员服务

58+阅读 · 2021年5月17日

【WWW2021】本体增强零样本学习

专知会员服务

35+阅读 · 2021年2月26日

【Google】监督对比学习，Supervised Contrastive Learning

专知会员服务

75+阅读 · 2020年4月24日