Deploying machine learning models to new tasks is a major challenge despite the large size of the modern training datasets. However, it is conceivable that the training data can be reweighted to be more representative of the new (target) task. We consider the problem of reweighing the training samples to gain insights into the distribution of the target task. Specifically, we formulate a distribution shift model based on the exponential tilt assumption and learn train data importance weights minimizing the KL divergence between labeled train and unlabeled target datasets. The learned train data weights can then be used for downstream tasks such as target performance evaluation, fine-tuning, and model selection. We demonstrate the efficacy of our method on Waterbirds and Breeds benchmarks.
翻译:尽管现代培训数据集规模庞大,但为新任务部署机器学习模型是一项重大挑战,尽管现代培训数据集规模庞大,但可以想象,培训数据可以重新加权,以更能代表新的(目标)任务。我们考虑了调整培训样本以了解目标任务分布情况的问题。具体地说,我们根据指数倾斜假设制定分配转移模型,并学习培训数据重要性,以尽量减少有标签的列车与无标签的目标数据集之间的KL差异。然后,可以将学习的列车数据加权数用于下游任务,例如目标绩效评估、微调和模型选择。我们展示了我们关于水鸟和品种基准的方法的有效性。