Due to the high human cost of annotation, it is non-trivial to curate a large-scale medical dataset that is fully labeled for all classes of interest. Instead, it would be convenient to collect multiple small partially labeled datasets from different matching sources, where the medical images may have only been annotated for a subset of classes of interest. This paper offers an empirical understanding of an under-explored problem, namely partially supervised multi-label classification (PSMLC), where a multi-label classifier is trained with only partially labeled medical images. In contrast to the fully supervised counterpart, the partial supervision caused by medical data scarcity has non-trivial negative impacts on the model performance. A potential remedy could be augmenting the partial labels. Though vicinal risk minimization (VRM) has been a promising solution to improve the generalization ability of the model, its application to PSMLC remains an open question. To bridge the methodological gap, we provide the first VRM-based solution to PSMLC. The empirical results also provide insights into future research directions on partially supervised learning under data scarcity.
翻译:由于人工注解费用高昂,因此,为各种利益类别制作一个贴上完整标签的大型医疗数据集并不难处理。相反,从不同的匹配来源收集多种贴上部分标签的小型数据集是方便的,因为其中医疗图像可能只对一组利益类别附加说明。本文件从经验上理解了探索不足的问题,即部分监督的多标签分类(PSMLC),对多标签分类器只进行了部分贴上标签的医疗图像的培训。与完全监督的对应方不同,医疗数据稀缺造成的部分监督对模型性能产生了非三边性的负面影响。一种潜在的补救措施可能是增加部分标签。虽然微小风险最小化(VRM)对于提高模型的普及能力是一个很有希望的解决办法,但对PSMLC的应用仍然是一个未决问题。为了缩小方法上的差距,我们为PSMLC提供了第一个基于VRM的解决方案。实验结果也为在数据稀缺情况下进行部分监督性学习的未来研究方向提供了深刻的见解。