It is expensive to collect training data for every possible domain that a vision model may encounter when deployed. We instead consider how simply verbalizing the training domain (e.g. "photos of birds") as well as domains we want to extend to but do not have data for (e.g. "paintings of birds") can improve robustness. Using a multimodal model with a joint image and language embedding space, our method LADS learns a transformation of the image embeddings from the training domain to each unseen test domain, while preserving task relevant information. Without using any images from the unseen test domain, we show that over the extended domain containing both training and unseen test domains, LADS outperforms standard fine-tuning and ensemble approaches over a suite of four benchmarks targeting domain adaptation and dataset bias
翻译:收集一个愿景模型在部署时可能遇到的每一个可能领域的培训数据费用昂贵。 我们考虑的是,如何简单地对培训领域(例如“鸟类照片”)以及我们希望扩展但没有数据的领域(例如“鸟类图片”)进行口头说明(例如“鸟类图片”),才能提高稳健性。 我们的方法LADS使用一个具有共同图像和语言嵌入空间的多式联运模式,学会将图像嵌入从培训领域转换为每个无形测试领域,同时保存任务相关信息。我们不使用任何来自无形测试领域的图像,而是在包含培训和无形测试领域的扩展领域显示,LADS在针对领域适应和数据设置偏差的四套基准上,超越了标准的微调和共用方法。