When building datasets, one needs to invest time, money and energy to either aggregate more data or to improve their quality. The most common practice favors quantity over quality without necessarily quantifying the trade-off that emerges. In this work, we study data-driven contextual decision-making and the performance implications of quality and quantity of data. We focus on contextual decision-making with a Newsvendor loss. This loss is that of a central capacity planning problem in Operations Research, but also that associated with quantile regression. We consider a model in which outcomes observed in similar contexts have similar distributions and analyze the performance of a classical class of kernel policies which weigh data according to their similarity in a contextual space. We develop a series of results that lead to an exact characterization of the worst-case expected regret of these policies. This exact characterization applies to any sample size and any observed contexts. The model we develop is flexible, and captures the case of partially observed contexts. This exact analysis enables to unveil new structural insights on the learning behavior of uniform kernel methods: i) the specialized analysis leads to very large improvements in quantification of performance compared to state of the art general purpose bounds. ii) we show an important non-monotonicity of the performance as a function of data size not captured by previous bounds; and iii) we show that in some regimes, a little increase in the quality of the data can dramatically reduce the amount of samples required to reach a performance target. All in all, our work demonstrates that it is possible to quantify in a precise fashion the interplay of data quality and quantity, and performance in a central problem class. It also highlights the need for problem specific bounds in order to understand the trade-offs at play.
翻译:当建立数据集时,人们需要投入时间、金钱和精力来汇总更多的数据,或提高数据质量。最常见的做法有利于数量而不是质量,而不一定量化出现的数据权衡。在这项工作中,我们研究数据驱动的背景决策以及数据质量和数量对业绩的影响。我们侧重于背景决策,并损失了Newsvendor。这种损失是业务研究中的核心能力规划问题,但也与量化回归相关。我们考虑的是一种模式,在类似情况下观察到的结果具有相似的分布,并分析一个典型的内核政策的性能,根据在环境空间中的数据相似性来权衡数据。我们开发了一系列的结果,导致对这些政策最坏的预期遗憾进行精确的定性。这种精确的定性适用于任何样本大小和任何观察到的环境。我们开发的模型是灵活的,并捕捉了部分观测到的情况。这种精确的分析能够揭示所有统一内核方法的学习行为的新结构。 i)专门分析导致对数据质量进行大幅度的改进,使其在质量上与其在环境上的相似性能上进行权衡,而不是在常规性数据中显示我们之前的性运行状态的某种程度。我们从某种程度看,我们总目的功能中可以显示一种重要的性能显示我们是如何显示我们是如何显示我们以前的数据。