When training and evaluating machine learning models on a large number of tasks, it is important to not only look at average task accuracy -- which may be biased by easy or redundant tasks -- but also worst-case accuracy (i.e. the performance on the task with the lowest accuracy). In this work, we show how to use techniques from the distributionally robust optimization (DRO) literature to improve worst-case performance in multitask learning. We highlight several failure cases of DRO when applied off-the-shelf and present an improved method, Lookahead-DRO (L-DRO), which mitigates these issues. The core idea of L-DRO is to anticipate the interaction between tasks during training in order to choose a dynamic re-weighting of the various task losses, which will (i) lead to minimal worst-case loss and (ii) train on as many tasks as possible. After demonstrating the efficacy of L-DRO on a small controlled synthetic setting, we evaluate it on two realistic benchmarks: a multitask version of the CIFAR-100 image classification dataset and a large-scale multilingual language modeling experiment. Our empirical results show that L-DRO achieves a better trade-off between average and worst-case accuracy with little computational overhead compared to several strong baselines.
翻译:在对大量任务进行训练和评价机器学习模型时,重要的是不仅要审视平均任务准确性 -- -- 可能因简单或冗余任务而偏差 -- -- 而且还要审视最坏情况的准确性(即以最低精确度选择任务业绩的动态重新加权);在这项工作中,我们展示了如何使用分布式强优化(DRO)文献的技术来提高多任务学习中最坏情况的业绩。我们着重介绍了DRO在应用现成和提出改进方法(L-DRO)时的一些失败案例,Lookahead-DRO(L-DRO)可以减轻这些问题。L-DRO的核心思想是预测培训期间任务之间的相互作用,以便选择对各种任务损失进行动态的重新加权。在这个工作中,我们展示了L-DRO在小型控制合成环境中的效率,我们从两个现实的基准上评价了它:一个多任务版本的CIRFAR-100图像分类数据集(L-DRO-DRO)和一个大规模多语言模型实验。我们最差的实验结果显示,与LRO之间的平均交易和最强的间接计算结果显示,与LRO之间的最差的精确性。