Machine learning (ML) is revolutionizing the world, affecting almost every field of science and industry. Recent algorithms (in particular, deep networks) are increasingly data-hungry, requiring large datasets for training. Thus, the dominant paradigm in ML today involves constructing large, task-specific datasets. However, obtaining quality datasets of such magnitude proves to be a difficult challenge. A variety of methods have been proposed to address this data bottleneck problem, but they are scattered across different areas, and it is hard for a practitioner to keep up with the latest developments. In this work, we propose a taxonomy of these methods. Our goal is twofold: (1) We wish to raise the community's awareness of the methods that already exist and encourage more efficient use of resources, and (2) we hope that such a taxonomy will contribute to our understanding of the problem, inspiring novel ideas and strategies to replace current annotation-heavy approaches.
翻译:机器学习(ML)正在使世界发生革命,几乎影响到科学和工业的每一个领域。最近的算法(特别是深网络)越来越缺乏数据,需要大量的数据集来进行培训。因此,今天ML的主导模式是建立庞大的、针对具体任务的数据集。然而,获得如此规模的高质量数据集证明是一个困难的挑战。提出了各种办法来解决数据瓶颈问题,但这些问题分散在不同领域,执业者很难跟上最新发展。我们在此工作中建议对这些方法进行分类。我们的目标是双重的:(1) 我们希望提高社区对现有方法的认识,鼓励更有效地使用资源,(2) 我们希望这种分类将有助于我们了解这一问题,激发新的想法和战略,以取代目前的批注-重度方法。