A recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have achieved extraordinary performance but also puzzling failures. Probing these models' abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks involving uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot uncertainty). We devise 10 types of tasks over 40 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol as it improves the out-of-the-box performance and does not require designing scores or tuning the model for each task. We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples. We also demonstrate Plex's capabilities on challenging tasks including zero-shot open set recognition, active learning, and uncertainty in conversational language understanding.
翻译:人工智能的最近趋势是使用预先培训的语言和愿景任务模型,这些模型取得了非凡的绩效,但也令人费解的失败。因此,以不同的方式检验这些模型的能力对实地来说至关重要。在本文件中,我们探索模型的可靠性,我们将可靠的模型定义为不仅能够取得强有力的预测性业绩,而且能够在许多决策任务上始终如一地执行,这些决策任务涉及不确定性(例如选择性预测、公开确认)、强有力的概括化(例如准确性和适当的评分规则,如在分配数据集内外的日志相似性),以及适应(例如积极学习、少发的不确定性)等。我们设计了超过40个数据集的10种任务,以便评估愿景和语言领域的不同可靠性。为了提高可靠性,我们开发了VT-Plex和T5-Plex,对愿景和语言模式模式的大规模扩展进行了预先训练,我们大大改进了在可靠性任务中采用的最新语言水平,并简化了传统的协议模式,因为它改进了超前阶段学习能力,我们不要求改进了每个选项的进度模型,也要求改进了每套指标的进度,包括改进了我们升级前的进度,改进了前的进度,升级的进度,要求改进了每个矩阵的进度,要求改进了进度,没有调整了进度,改进了进度要求改进了进度,改进了进度。