We analyze the growth of dataset sizes used in machine learning for natural language processing and computer vision, and extrapolate these using two methods; using the historical growth rate and estimating the compute-optimal dataset size for future predicted compute budgets. We investigate the growth in data usage by estimating the total stock of unlabeled data available on the internet over the coming decades. Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026. By contrast, the stock of low-quality language data and image data will be exhausted only much later; between 2030 and 2050 (for low-quality language) and between 2030 and 2060 (for images). Our work suggests that the current trend of ever-growing ML models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available.
翻译:我们分析用于自然语言处理和计算机视觉的机器学习的数据集规模的增长,并用两种方法进行推算;使用历史增长率和估计未来预测的计算预算的计算最佳数据集规模;我们通过估计未来几十年互联网上可获得的未贴标签数据总量来调查数据使用的增长;我们的分析表明,高质量语言数据存量将很快枯竭;可能于2026年之前。相反,低质量语言数据和图像数据存量将仅仅在更晚的时间里用完;2030年至2050年(低质量语言)和2030年至2060年(图像)之间。我们的工作表明,如果数据效率没有大幅提高或新的数据来源出现,依赖巨大数据集的不断增长的ML模型当前趋势可能会放缓。