It's regarded as an axiom that a good model is one that compromises between bias and variance. The bias is measured in training cost, while the variance of a (say, regression) model is measure by the cost associated with a validation set. If reducing bias is the goal, one will strive to fetch as complex a model as necessary, but complexity is invariably coupled with variance: greater complexity implies greater variance. In practice, driving training cost to near zero does not pose a fundamental problem; in fact, a sufficiently complex decision tree is perfectly capable of driving training cost to zero; however, the problem is often with controlling the model's variance. We investigate various regression model frameworks, including generalized linear models, Cox proportional hazard models, ARMA, and illustrate how misspecifying a model affects the variance.
翻译:人们认为, 良好的模式是一个在偏差和差异之间达成妥协的好模式。 偏差在培训成本中衡量, 而( 回归) 模式的差异则是用验证集的相关成本来衡量的。 如果减少偏差是目标, 人们将努力尽可能地获得一个复杂的模式, 但复杂性总是伴随着差异: 更大的复杂性意味着更大的差异。 在实践中, 将培训费用推到接近零不构成根本问题; 事实上, 足够复杂的决策树完全能够将培训费用推向零; 但是, 问题往往在于控制模型的差异。 我们调查各种回归模型框架, 包括通用线性模型、 Cox 比例危害模型、 ARMA, 并演示错误地描述模型如何影响差异 。