Semi-supervised learning (SSL) algorithms have had great success in recent years in limited labeled data regimes. However, the current state-of-the-art SSL algorithms are computationally expensive and entail significant compute time and energy requirements. This can prove to be a huge limitation for many smaller companies and academic groups. Our main insight is that training on a subset of unlabeled data instead of entire unlabeled data enables the current SSL algorithms to converge faster, significantly reducing computational costs. In this work, we propose RETRIEVE, a coreset selection framework for efficient and robust semi-supervised learning. RETRIEVE selects the coreset by solving a mixed discrete-continuous bi-level optimization problem such that the selected coreset minimizes the labeled set loss. We use a one-step gradient approximation and show that the discrete optimization problem is approximately submodular, enabling simple greedy algorithms to obtain the coreset. We empirically demonstrate on several real-world datasets that existing SSL algorithms like VAT, Mean-Teacher, FixMatch, when used with RETRIEVE, achieve a) faster training times, b) better performance when unlabeled data consists of Out-of-Distribution (OOD) data and imbalance. More specifically, we show that with minimal accuracy degradation, RETRIEVE achieves a speedup of around $3\times$ in the traditional SSL setting and achieves a speedup of $5\times$ compared to state-of-the-art (SOTA) robust SSL algorithms in the case of imbalance and OOD data. RETRIEVE is available as a part of the CORDS toolkit: https://github.com/decile-team/cords.
翻译:近些年来,SSSL 算法在有限的标签数据系统中取得了巨大的成功。 然而, 目前的 SSL 算法在计算成本上成本昂贵, 并且需要大量计算时间和能量。 这可以证明对许多较小的公司和学术团体来说是一个巨大的限制。 我们的主要洞察力是, 有关一个未贴标签数据子集的培训, 而不是整个未贴标签数据, 使得目前的 SSL 算法能够更快地聚合, 大大降低计算成本。 在这项工作中, 我们提议 RETRIEVE, 一个高效和稳健的半监督快速学习的核心选择框架。 RETREVEVE 选择核心集, 解决一个混合的连续双级优化问题, 从而可以最大限度地减少标签损失。 我们使用一个一步的梯度近似近距离的近似点, 使得简单的贪婪算法能够获取核心设置。 我们从一些现实世界的数据集设定了现有的 SSL 算法, 比如 VAT, MAEORER, NAT MIT, 具体地, 当我们使用 ROTAD 数据运行时, 将一个更快速的SDADD 数据运行 的运行的运行中, 的运行中, 将一个更快速的运行中, 。