As general purpose vision models get increasingly effective at a wide set of tasks, it is imperative that they be consistent across the tasks they support. Inconsistent AI models are considered brittle and untrustworthy by human users and are more challenging to incorporate into larger systems that take dependencies on their outputs. Measuring consistency between very heterogeneous tasks that might include outputs in different modalities is challenging since it is difficult to determine if the predictions are consistent with one another. As a solution, we introduce a benchmark dataset, COCOCON, where we use contrast sets created by modifying test instances for multiple tasks in small but semantically meaningful ways to change the gold label, and outline metrics for measuring if a model is consistent by ranking the original and perturbed instances across tasks. We find that state-of-the-art systems suffer from a surprisingly high degree of inconsistent behavior across tasks, especially for more heterogeneous tasks. Finally, we propose using a rank correlation-based auxiliary objective computed over large automatically created cross-task contrast sets to improve the multi-task consistency of large unified models, while retaining their original accuracy on downstream tasks. Project website available at https://adymaharana.github.io/cococon/
翻译:随着通用视觉模型对广泛任务的支持变得越来越有效,保持其跨任务的一致性至关重要。不一致的AI模型被认为是脆弱和不可信的,并且很难将它们的输出嵌入到具有依赖于其输出的大型系统中。由于很难确定预测是否彼此一致,因此在非常异构的任务之间测量一致性(可能包括不同模态的输出)具有挑战性。作为解决方案,我们引入了一个基准数据集COCOCON,该数据集使用修改多个任务的测试实例而创建的对比集,在语义上进行微小但有意义的更改以更改黄金标签,并提出了用于通过在任务之间对原始和扰动实例进行排序来衡量模型一致性的指标。我们发现,最先进的系统在各个任务之间存在着令人惊讶的高度不一致行为,尤其是在更异构的任务中。最后,我们建议使用基于秩相关的辅助目标,在自动创建的大型跨任务对比集上进行计算,以提高大型统一模型的多任务一致性,同时保留其在下游任务上的原始准确性。项目网站可在 https://adymaharana.github.io/cococon/ 访问。