Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks. However, it remains challenging to evaluate the transferablity of these models due to the lack of easy-to-use evaluation toolkits and public benchmarks. To tackle this, we build ELEVATER (Evaluation of Language-augmented Visual Task-level Transfer), the first benchmark and toolkit for evaluating(pre-trained) language-augmented visual models. ELEVATER is composed of three components. (i) Datasets. As downstream evaluation suites, it consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. (ii) Toolkit. An automatic hyper-parameter tuning toolkit is developed to facilitate model evaluation on downstream tasks. (iii) Metrics. A variety of evaluation metrics are used to measure sample-efficiency (zero-shot and few-shot) and parameter-efficiency (linear probing and full model fine-tuning). ELEVATER is a platform for Computer Vision in the Wild (CVinW), and is publicly released at at https://computer-vision-in-the-wild.github.io/ELEVATER/
翻译:自然语言监督的视觉表现最近在许多开创性作品中显示出很大的希望,总的来说,这些语言强化的视觉模型显示出向各种数据集和任务的强大可转移性,然而,由于缺乏易于使用的评估工具包和公共基准,评价这些模型的转移性仍具有挑战性;为此,我们建立了语言强化视觉水平传输评价(语言强化视觉水平传输评价),这是评价(预先培训的)语言强化视觉模型的第一个基准和工具包。 ELEVATER由三个组成部分组成:(一) 数据集。作为下游评价套件,由20个图像分类数据集和35个物体探测数据集组成,每个数据集都由外部知识加以补充。(二) 工具包:开发了一个自动超参数调整工具包,以便利对下游任务进行模型评价。 (三) 矩阵:使用各种评价指标来衡量抽样效率(零光和少图)和参数效率(线性模型和全模型调整)。