The design of data-driven formulations for machine learning and decision-making with good out-of-sample performance is a key challenge. The observation that good in-sample performance does not guarantee good out-of-sample performance is generally known as overfitting. Practical overfitting can typically not be attributed to a single cause but instead is caused by several factors all at once. We consider here three overfitting sources: (i) statistical error as a result of working with finite sample data, (ii) data noise which occurs when the data points are measured only with finite precision, and finally (iii) data misspecification in which a small fraction of all data may be wholly corrupted. We argue that although existing data-driven formulations may be robust against one of these three sources in isolation they do not provide holistic protection against all overfitting sources simultaneously. We design a novel data-driven formulation which does guarantee such holistic protection and is furthermore computationally viable. Our distributionally robust optimization formulation can be interpreted as a novel combination of a Kullback-Leibler and Levy-Prokhorov robust optimization formulation. Finally, we show how in the context of classification and regression problems several popular regularized and robust formulations reduce to a particular case of our proposed more general formulation.
翻译:为机器学习和决策设计由数据驱动的配方,以良好的超模性能进行机器学习和决策,这是一个关键的挑战。认为好的模版性能并不能保证良好的全模性性性能,一般认为是过于适合的。实用的配方通常不能归因于一个单一的原因,而是同时由若干因素造成。我们在此考虑三个不适当的来源:(一) 与有限的抽样数据合作导致的统计错误;(二) 数据噪音,当数据点仅以有限的精确度衡量数据点时,即出现的数据噪音,最后(三) 所有数据中一小部分可能完全腐蚀的数据区分不当。我们争辩说,虽然现有的数据驱动性配方可能对这三个来源之一的隔绝性性性能强,但它们不能同时提供整体保护,防止所有来源的过度使用。我们设计了一个新的由数据驱动的配方,可以保证这种整体保护,而且进一步计算可行。我们分配上稳健的优化配方可以被解释为一种新型组合,即Kullack-Liper和Levy-Prokhorov稳健的配制制制制制。我们提议了一种更稳健的配制的组合。我们建议在一种比较稳健的分类和制性化的制制制和制制制制制制制制和制制制制制制制制制制制制式的情形下如何。我们展示了一种比较为一种较制制式的立式的立式的立。我们式的立式的立式的立式的立式的立式的立式的立。我们方和制式和制式的立式的立式的立式的立式的立。我们展示了一种较制制制制制制制制制制制制制制制制制制制制制制式的立。我们为一种较制制制制式的立式的立式的立。我们为制式的立式的制式的立。我们的立式的立式的立式的立。我们式的制制制制制制制制制制制制制制式的立式的立式的立式的立式的立式的立式的立式的立式的立式的立式的立。我们展示式制制制式制式制式制式制式的制制制制式制式制式制式制制