Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm for modeling the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. Additionally, this formulation generalizes to tasks requiring multiple data sources, such as labeled and unlabeled data used in semi-supervised learning. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs.
翻译:现代深层学习系统需要庞大的数据集才能取得令人印象深刻的业绩,但对于需要收集多少或哪类数据却没有多少指导。 过度收集数据会造成不必要的当前费用,而收集不足则可能带来未来费用和延误工作流程。 我们提出一个新的模式,建模数据收集工作流程,作为正式的最佳数据收集问题,使设计者能够具体规定业绩目标、收集费用、时间范围,以及未能实现目标的处罚。 此外,这一提法概括了需要多种数据来源的任务,如在半监督学习中使用的标签和未标记数据。 为了解决我们的问题,我们开发了“学习-优化-聚合”(LOC),这可以最大限度地减少预期的未来收集费用。最后,我们用数字比较了我们的框架,通过从神经缩放法中外推来估算数据需求的传统基线。我们大幅降低在若干分类、分解和检测任务上未能达到预期业绩目标的风险,同时保持较低的总收集费用。