We study the problem of generating a training-free task-dependent visual classifier from text descriptions without visual samples. This \textit{Text-to-Model} (T2M) problem is closely related to zero-shot learning, but unlike previous work, a T2M model infers a model tailored to a task, taking into account all classes in the task. We analyze the symmetries of T2M, and characterize the equivariance and invariance properties of corresponding models. In light of these properties, we design an architecture based on hypernetworks that given a set of new class descriptions predicts the weights for an object recognition model which classifies images from those zero-shot classes. We demonstrate the benefits of our approach compared to zero-shot learning from text descriptions in image and point-cloud classification using various types of text descriptions: From single words to rich text descriptions.
翻译:我们从没有视觉样本的文本描述中研究产生一个无培训任务依赖的视觉分类器的问题。 这个 \ textit{Text-to-Model} (T2M) 问题与零光学习密切相关, 但与先前的工作不同, T2M 模型推断出一个适合任务的模式, 同时考虑到任务中的所有类别。 我们分析 T2M 的对称, 并描述相应模型的等同性和差异性。 根据这些特性, 我们设计一个基于超网络的架构, 给一套新的类描述来预测对象识别模型的权重, 该模型将图像从零光类中分类。 我们用各种文字描述来展示我们的方法的好处, 而不是从图像中的文本描述和点分界分类中零光学习。 从单词到丰富的文本描述。