Consider a prediction setting where a few inputs (e.g., satellite images) are expensively annotated with the prediction targets (e.g., crop types), and many inputs are cheaply annotated with auxiliary information (e.g., climate information). How should we best leverage this auxiliary information for the prediction task? Empirically across three image and time-series datasets, and theoretically in a multi-task linear regression setting, we show that (i) using auxiliary information as input features improves in-distribution error but can hurt out-of-distribution (OOD) error; while (ii) using auxiliary information as outputs of auxiliary tasks to pre-train a model improves OOD error. To get the best of both worlds, we introduce In-N-Out, which first trains a model with auxiliary inputs and uses it to pseudolabel all the in-distribution inputs, then pre-trains a model on OOD auxiliary outputs and fine-tunes this model with the pseudolabels (self-training). We show both theoretically and empirically that In-N-Out outperforms auxiliary inputs or outputs alone on both in-distribution and OOD error.
翻译:在预测目标(如作物类型)对一些投入(如卫星图像)进行昂贵的附加说明和许多投入以辅助信息(如气候信息)为廉价附加说明(如气候信息)的情况下,我们应如何最好地利用这一辅助信息进行预测任务?在三个图像和时间序列数据集中,从理论上讲,在多任务线性回归环境中,我们显示:(一) 使用辅助信息作为输入特征,会改善分配错误,但会伤害分配(OOD)的错误;(二) 使用辅助信息作为辅助任务产出,预先培训一个模型,会改善OOD错误。为了获得两个世界的最佳,我们采用“内输出”系统,首先用辅助投入培训模型,然后用它来模拟所有分配投入,然后先先用“多任务线性回归”模型,然后用假标签(自我培训)微调这一模型。我们从理论上和经验上都显示,N外源外的辅助投入或仅用OD值错误来显示OA和OD错误。