We study a class of prediction problems in which relatively few observations have associated responses, but all observations include both standard covariates as well as additional "helper" covariates. While the end goal is to make high-quality predictions using only the standard covariates, helper covariates can be exploited during training to improve prediction. Helper covariates arise in many applications, including forecasting in time series; incorporation of biased or mis-calibrated predictions from foundation models; and sharing information in transfer learning. We propose "prediction aided by surrogate training" ($\texttt{PAST}$), a class of methods that exploit labeled data to construct a response estimator based on both the standard and helper covariates; and then use the full dataset with pseudo-responses to train a predictor based only on standard covariates. We establish guarantees on the prediction error of this procedure, with the response estimator allowed to be constructed in an arbitrary way, and the final predictor fit by empirical risk minimization over an arbitrary function class. These upper bounds involve the risk associated with the oracle data set (all responses available), plus an overhead that measures the accuracy of the pseudo-responses. This theory characterizes both regimes in which $\texttt{PAST}$ accuracy is comparable to the oracle accuracy, as well as more challenging regimes where it behaves poorly. We demonstrate its empirical performance across a range of applications, including forecasting of societal ills over time with future covariates as helpers; prediction of cardiovascular risk after heart attacks with prescription data as helpers; and diagnosing pneumonia from chest X-rays using machine-generated predictions as helpers.
翻译:暂无翻译