A fundamental limitation of applying semi-supervised learning in real-world settings is the assumption that unlabeled test data contains only classes previously encountered in the labeled training data. However, this assumption rarely holds for data in-the-wild, where instances belonging to novel classes may appear at testing time. Here, we introduce a novel open-world semi-supervised learning setting that formalizes the notion that novel classes may appear in the unlabeled test data. In this novel setting, the goal is to solve the class distribution mismatch between labeled and unlabeled data, where at the test time every input instance either needs to be classified into one of the existing classes or a new unseen class needs to be initialized. To tackle this challenging problem, we propose ORCA, an end-to-end deep learning approach that introduces uncertainty adaptive margin mechanism to circumvent the bias towards seen classes caused by learning discriminative features for seen classes faster than for the novel classes. In this way, ORCA reduces the gap between intra-class variance of seen with respect to novel classes. Experiments on image classification datasets and a single-cell annotation dataset demonstrate that ORCA consistently outperforms alternative baselines, achieving 25% improvement on seen and 96% improvement on novel classes of the ImageNet dataset.
翻译:在现实世界环境中应用半监督学习的一个基本限制是假设未贴标签的测试数据只包含标签培训数据中以前遇到的类别。 但是,这一假设很少适用于在微博中的数据, 属于新类的情况可能在测试时间出现。 在这里, 我们引入了一个新的开放世界半监督学习环境, 正式确定新类可能出现在未贴标签测试数据中的概念。 在这个新环境里, 目标是解决标签和未贴标签数据之间的等级分布不匹配, 在测试时, 每个输入实例都需要分类为现有类别之一或一个新的不可见类别才需要初始化。 为了解决这一具有挑战性的问题, 我们提议了端到端的深层次学习方法, 引入了不确定性的适应性差幅机制, 以避开因学习新类中可见的歧视性特征而导致的对所见班的偏差, 比新类要快。 在这个新类中, ORCA 缩小了在新类中看到的差异。 图像分类数据集的实验和单细胞解析数据显示, 正在持续地改进 OSSA 25 的图像基准 。