Unsupervised disentanglement has been shown to be theoretically impossible without inductive biases on the models and the data. As an alternative approach, recent methods rely on limited supervision to disentangle the factors of variation and allow their identifiability. While annotating the true generative factors is only required for a limited number of observations, we argue that it is infeasible to enumerate all the factors of variation that describe a real-world image distribution. To this end, we propose a method for disentangling a set of factors which are only partially labeled, as well as separating the complementary set of residual factors that are never explicitly specified. Our success in this challenging setting, demonstrated on synthetic benchmarks, gives rise to leveraging off-the-shelf image descriptors to partially annotate a subset of attributes in real image domains (e.g. of human faces) with minimal manual effort. Specifically, we use a recent language-image embedding model (CLIP) to annotate a set of attributes of interest in a zero-shot manner and demonstrate state-of-the-art disentangled image manipulation results.
翻译:未经监督的分解在理论上是不可能的,没有模型和数据的暗示偏见。作为一种替代方法,最近的方法依靠有限的监督来解开变异因素,并允许其识别性。虽然只对数量有限的观察需要说明真正的变异因素,但我们认为,列举描述真实世界图像分布的所有变异因素是不可行的。为此,我们提出一种方法,分离一组仅被部分标注的因素,并分离一组从未明确指明的补充性残余因素。我们在这种具有挑战性的环境下的成功,在合成基准上表现出来,导致利用现成图像描述器,部分注解真实图像领域(如人脸)的一组属性,手动努力微乎其微。具体地说,我们用最新的语言嵌入模型(CLIP)来以零发方式说明一组感兴趣的属性,并展示状态的解动图像操纵结果。