Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance? This question is of critical importance in applications such as autonomous driving or medical imaging where collecting data is expensive and time-consuming. Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget. Prior work on neural scaling laws suggest that the power-law function can fit the validation performance curve and extrapolate it to larger data set sizes. We find that this does not immediately translate to the more difficult downstream task of estimating the required data set size to meet a target performance. In this work, we consider a broad class of computer vision tasks and systematically investigate a family of functions that generalize the power-law function to allow for better estimation of data requirements. Finally, we show that incorporating a tuned correction factor and collecting over multiple rounds significantly improves the performance of the data estimators. Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
翻译:考虑到一个小型的培训数据集和学习算法,要达到目标验证或测试性能,还需要多少数据才能达到目标验证或测试性能?这个问题在诸如自动驾驶或医疗成像等收集数据费用昂贵和耗时的应用程序中至关重要。高估或低估数据要求会产生大量费用,而如果有足够的预算,则可以避免这些费用。关于神经测量法的先前工作表明,权力法功能可以适合鉴定性能曲线,并推算出其更大的数据集大小。我们发现,这并没有立即转化为更困难的下游任务,即估计所需数据集的大小,以达到目标性能。在这项工作中,我们考虑一系列广泛的计算机视觉任务,并系统地调查一系列职能,即普及权力法功能,以便能够更好地估计数据要求。最后,我们表明,纳入调整性校正系数和收集多轮测算器的功能,可以大大改进数据估量器的性能。使用我们的指南,从业人员可以准确估计机器学习系统的数据要求,以便在开发时间和数据获取成本两方面节省。