Cross-validation techniques for risk estimation and model selection are widely used in statistics and machine learning. However, the understanding of the theoretical properties of learning via model selection with cross-validation risk estimation is quite low in face of its widespread use. In this context, this paper presents learning via model selection with cross-validation risk estimation as a general systematic learning framework within classical statistical learning theory and establishes distribution-free deviation bounds in terms of VC dimension, giving detailed proofs of the results and considering both bounded and unbounded loss functions. In particular, we investigate how the generalization of learning via model selection may be increased by modeling the collection of candidate models. We define the Learning Spaces as a class of candidate models in which the partial order by inclusion reflects the models complexities, and we formalize a manner of defining them based on domain knowledge. We illustrate this modeling in a worst-case scenario of learning a classifier with finite domain and a typical scenario of linear regression. Through theoretical insights and concrete examples, we aim to provide guidance on selecting the family of candidate models based on domain knowledge to increase generalization.
翻译:暂无翻译