It is expensive to collect training data for every possible domain that a vision model may encounter when deployed. We instead consider how simply verbalizing the training domain (e.g. "photos of birds") as well as domains we want to extend to but do not have data for (e.g. "paintings of birds") can improve robustness. Using a multimodal model with a joint image and language embedding space, our method LADS learns a transformation of the image embeddings from the training domain to each unseen test domain, while preserving task relevant information. Without using any images from the unseen test domain, we show that over the extended domain containing both training and unseen test domains, LADS outperforms standard fine-tuning and ensemble approaches over a suite of four benchmarks targeting domain adaptation and dataset bias.
翻译:为部署时可能遇到的每一个可能的领域收集培训数据费用昂贵。 我们考虑的是,如何简单地对培训领域(例如“鸟类照片”)和我们想要扩展但不具备数据的领域(例如“鸟类图片”)进行口头陈述,才能提高稳健性。 我们的方法LADS使用一个具有共同图像和语言嵌入空间的多式联运模式,学会将图像嵌入从培训领域转换为每个看不见的测试领域,同时保存任务相关信息。 我们不使用任何来自无形测试领域的图像,我们显示在包含培训和无形测试领域的扩展领域上,LADS超越了一套针对领域适应和数据设置偏差的四大基准标准微调和组合方法。