We seek to semantically describe a set of images, capturing both the attributes of single images and the variations within the set. Our procedure is analogous to Principle Component Analysis, in which the role of projection vectors is replaced with generated phrases. First, a centroid phrase that has the largest average semantic similarity to the images in the set is generated, where both the computation of the similarity and the generation are based on pretrained vision-language models. Then, the phrase that generates the highest variation among the similarity scores is generated, using the same models. The next phrase maximizes the variance subject to being orthogonal, in the latent space, to the highest-variance phrase, and the process continues. Our experiments show that our method is able to convincingly capture the essence of image sets and describe the individual elements in a semantically meaningful way within the context of the entire set. Our code is available at: https://github.com/OdedH/textual-pca.
翻译:我们试图用语义描述一组图像, 捕捉单个图像的属性和集中的变异。 我们的程序类似于原则构件分析, 将投影矢量的作用替换为生成的词组。 首先, 生成了一个与集中图像具有最大平均语义相似性的中位词组, 其相似性和生成的计算都以预先训练的视觉语言模型为基础。 然后, 生成了在相近得分之间产生最大差异的词组, 使用相同的模型。 下一个词组将潜在空间的正纵形差异最大化到最高变异性词组, 程序还在继续。 我们的实验显示, 我们的方法能够令人信服地捕捉图像组的精髓, 并在整个集中以语义上有意义的方式描述单个元素。 我们的代码可以在 https://github. com/ OdedH/ textual- pca 上查阅 。