Labeling data is one of the most costly processes in machine learning pipelines. Active learning is a standard approach to alleviating this problem. Pool-based active learning first builds a pool of unlabelled data and iteratively selects data to be labeled so that the total number of required labels is minimized, keeping the model performance high. Many effective criteria for choosing data from the pool have been proposed in the literature. However, how to build the pool is less explored. Specifically, most of the methods assume that a task-specific pool is given for free. In this paper, we advocate that such a task-specific pool is not always available and propose the use of a myriad of unlabelled data on the Web for the pool for which active learning is applied. As the pool is extremely large, it is likely that relevant data exist in the pool for many tasks, and we do not need to explicitly design and build the pool for each task. The challenge is that we cannot compute the acquisition scores of all data exhaustively due to the size of the pool. We propose an efficient method, Seafaring, to retrieve informative data in terms of active learning from the Web using a user-side information retrieval algorithm. In the experiments, we use the online Flickr environment as the pool for active learning. This pool contains more than ten billion images and is several orders of magnitude larger than the existing pools in the literature for active learning. We confirm that our method performs better than existing approaches of using a small unlabelled pool.
翻译:标签数据是机器学习管道中最昂贵的进程之一。 积极的学习是缓解这一问题的标准方法。 以集体为基础的积极学习首先建立一个无标签数据库, 并迭接地选择要贴上标签的数据, 以便尽可能减少所需的标签总数, 使模型性能保持高。 文献中已经提出了从集体中选择数据的许多有效标准 。 但是, 如何建立集合库的探索较少 。 具体地说, 大多数方法假设免费提供一个任务特定集合。 在本文中, 我们主张, 这样一个任务特定集合并不总是可用, 并提议在网上为应用积极学习的集合使用大量无标签的数据。 由于集合非常大, 很可能在集体中存在许多任务所需的相关数据, 我们不需要为每项任务明确设计和建立集合库。 挑战在于, 我们无法根据集合的大小, 详尽地计算所有数据的得分数。 我们提出一个有效的方法, “ 航海”, 来从网络上获取大量未标定的数据, 以便从积极学习的文献库中获取大量信息数据。 我们使用比现有数据库要更多的用户- 学习更多的工具库 。 我们使用比现有10亿级的系统 学习更多的工具, 正在学习更多的现有数据库 学习更多的现有数据库 。