Large-scale multi-label classification datasets are commonly, and perhaps inevitably, partially annotated. That is, only a small subset of labels are annotated per sample. Different methods for handling the missing labels induce different properties on the model and impact its accuracy. In this work, we analyze the partial labeling problem, then propose a solution based on two key ideas. First, un-annotated labels should be treated selectively according to two probability quantities: the class distribution in the overall dataset and the specific label likelihood for a given data sample. We propose to estimate the class distribution using a dedicated temporary model, and we show its improved efficiency over a naive estimation computed using the dataset's partial annotations. Second, during the training of the target model, we emphasize the contribution of annotated labels over originally un-annotated labels by using a dedicated asymmetric loss. With our novel approach, we achieve state-of-the-art results on OpenImages dataset (e.g. reaching 87.3 mAP on V6). In addition, experiments conducted on LVIS and simulated-COCO demonstrate the effectiveness of our approach. Code is available at https://github.com/Alibaba-MIIL/PartialLabelingCSL.
翻译:大型多标签分类数据集通常而且可能不可避免地部分地附带说明。也就是说,只有一小部分标签按抽样提供附加说明。处理缺失标签的不同方法在模型上产生不同属性并影响其准确性。在这项工作中,我们分析部分标签问题,然后根据两个关键概念提出解决办法。首先,非附加说明标签应按两个概率量有选择地处理:整个数据集中的分类分布和特定数据抽样的具体标签可能性。我们提议使用专门的临时模型来估计分类分布情况,我们用数据集部分说明来显示它比天真估计的效率提高。第二,在目标模型培训期间,我们强调通过使用专用的不对称损失来说明原未加说明的标签的贡献。我们采用新办法,在OpenImages数据集(如在V6上达到87.3 mAP)上取得最新的结果。此外,对LVIS和模拟CO进行的实验显示了我们的方法的有效性。