Data is the fuel powering AI and creates tremendous value for many domains. However, collecting datasets for AI is a time-consuming, expensive, and complicated endeavor. For practitioners, data investment remains to be a leap of faith in practice. In this work, we study the data budgeting problem and formulate it as two sub-problems: predicting (1) what is the saturating performance if given enough data, and (2) how many data points are needed to reach near the saturating performance. Different from traditional dataset-independent methods like PowerLaw, we proposed a learning method to solve data budgeting problems. To support and systematically evaluate the learning-based method for data budgeting, we curate a large collection of 383 tabular ML datasets, along with their data vs performance curves. Our empirical evaluation shows that it is possible to perform data budgeting given a small pilot study dataset with as few as $50$ data points.
翻译:然而,为AI收集数据集是一项耗时、昂贵和复杂的工作。对于实践者来说,数据投资仍然是实践信仰的飞跃。在这项工作中,我们研究数据预算编制问题,并将其分为两个子问题:(1) 如果提供足够数据,那么饱和性能是什么样的,以及(2) 接近饱和性能需要多少数据点。不同于传统的数据数据集独立方法,例如PowerLaw,我们提出了解决数据预算编制问题的学习方法。为了支持和系统评估基于学习的数据预算编制方法,我们整理了大量383个表格ML数据集,连同数据相对于业绩曲线。我们的经验评估表明,如果有一个小的实验性数据集,只有50美元的数据点,那么数据预算编制是可能的。