RETREVE: 高效和强力半监督学习的核心选择 (RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning)

Semi-supervised learning (SSL) algorithms have had great success in recent years in limited labeled data regimes. However, the current state-of-the-art SSL algorithms are computationally expensive and entail significant compute time and energy requirements. This can prove to be a huge limitation for many smaller companies and academic groups. Our main insight is that training on a subset of unlabeled data instead of entire unlabeled data enables the current SSL algorithms to converge faster, significantly reducing computational costs. In this work, we propose RETRIEVE, a coreset selection framework for efficient and robust semi-supervised learning. RETRIEVE selects the coreset by solving a mixed discrete-continuous bi-level optimization problem such that the selected coreset minimizes the labeled set loss. We use a one-step gradient approximation and show that the discrete optimization problem is approximately submodular, enabling simple greedy algorithms to obtain the coreset. We empirically demonstrate on several real-world datasets that existing SSL algorithms like VAT, Mean-Teacher, FixMatch, when used with RETRIEVE, achieve a) faster training times, b) better performance when unlabeled data consists of Out-of-Distribution (OOD) data and imbalance. More specifically, we show that with minimal accuracy degradation, RETRIEVE achieves a speedup of around $3\times$ in the traditional SSL setting and achieves a speedup of $5\times$ compared to state-of-the-art (SOTA) robust SSL algorithms in the case of imbalance and OOD data. RETRIEVE is available as a part of the CORDS toolkit: https://github.com/decile-team/cords.

翻译：近些年来,SSSL 算法在有限的标签数据系统中取得了巨大的成功。然而, 目前的 SSL 算法在计算成本上成本昂贵, 并且需要大量计算时间和能量。这可以证明对许多较小的公司和学术团体来说是一个巨大的限制。我们的主要洞察力是, 有关一个未贴标签数据子集的培训, 而不是整个未贴标签数据, 使得目前的 SSL 算法能够更快地聚合, 大大降低计算成本。在这项工作中, 我们提议 RETRIEVE, 一个高效和稳健的半监督快速学习的核心选择框架。 RETREVEVE 选择核心集, 解决一个混合的连续双级优化问题, 从而可以最大限度地减少标签损失。我们使用一个一步的梯度近似近距离的近似点, 使得简单的贪婪算法能够获取核心设置。我们从一些现实世界的数据集设定了现有的 SSL 算法, 比如 VAT, MAEORER, NAT MIT, 具体地, 当我们使用 ROTAD 数据运行时, 将一个更快速的SDADD 数据运行的运行的运行中, 的运行中, 将一个更快速的运行中, 。