A special purpose learning system assumes knowledge of admissible tasks at design time. Adapting such a system to unforeseen tasks requires architecture manipulation such as adding an output head for each new task or dataset. In this work, we propose a task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text. The system supports a wide range of vision tasks such as classification, localization, question answering, captioning, and more. We evaluate the system's ability to learn multiple skills simultaneously, to perform tasks with novel skill-concept combinations, and to learn new skills efficiently and without forgetting.
翻译:特殊目的学习系统将这种系统适应意外任务需要结构操作,例如为每项新任务或数据集增加一个输出头。在这项工作中,我们提议一个任务不可知的视觉语言系统,接受图像和自然语言任务描述和输出,并结合框、信任和文本。该系统支持一系列广泛的愿景任务,如分类、本地化、问答、字幕等等。我们评价系统同时学习多种技能、以新的技能组合执行任务、高效和不忘地学习新技能的能力。