Recent research has shown it is possible to perform zero-shot classification tasks by training a classifier with synthetic data generated by a diffusion model. However, the performance of this approach is still inferior to that of recent vision-language models. It has been suggested that the reason for this is a domain gap between the synthetic and real data. In our work, we show that this domain gap is not the main issue, and that diversity in the synthetic dataset is more important. We propose a $\textit{bag of tricks}$ to improve diversity and are able to achieve performance on par with one of the vision-language models, CLIP. More importantly, this insight allows us to endow zero-shot classification capabilities on any classification model.
翻译:最近的研究显示,通过对一个分类员进行扩散模型产生的合成数据培训,可以完成零点分类任务。然而,这一方法的绩效仍然低于最近的视觉语言模型,其原因是合成数据与真实数据之间存在领域差距。在我们的工作中,我们表明,这一领域差距不是主要问题,合成数据集的多样性更为重要。我们建议用$\textit{tick}来改进多样性,并能够实现与视觉语言模型之一CLIP相同的性能。更重要的是,这种洞察使我们能够在任何分类模型上达到零点分类能力。