This paper explores a hierarchical prompting mechanism for the hierarchical image classification (HIC) task. Different from prior HIC methods, our hierarchical prompting is the first to explicitly inject ancestor-class information as a tokenized hint that benefits the descendant-class discrimination. We think it well imitates human visual recognition, i.e., humans may use the ancestor class as a prompt to draw focus on the subtle differences among descendant classes. We model this prompting mechanism into a Transformer with Hierarchical Prompting (TransHP). TransHP consists of three steps: 1) learning a set of prompt tokens to represent the coarse (ancestor) classes, 2) on-the-fly predicting the coarse class of the input image at an intermediate block, and 3) injecting the prompt token of the predicted coarse class into the intermediate feature. Though the parameters of TransHP maintain the same for all input images, the injected coarse-class prompt conditions (modifies) the subsequent feature extraction and encourages a dynamic focus on relatively subtle differences among the descendant classes. Extensive experiments show that TransHP improves image classification on accuracy (e.g., improving ViT-B/16 by +2.83% ImageNet classification accuracy), training data efficiency (e.g., +12.69% improvement under 10% ImageNet training data), and model explainability. Moreover, TransHP also performs favorably against prior HIC methods, showing that TransHP well exploits the hierarchical information.
翻译:本文探讨用于层次化图像分类(HIC)任务的分层提示机制。与先前的HIC方法不同,我们的分层提示是第一个明确将祖先类信息作为令牌化提示注入并有益于子类判别的方法。我们认为这很好地模拟了人类视觉识别,即人类可能使用祖先类作为提示,将注意力集中于子类之间的细微差别上。我们将这种提示机制建模为一个带有分层提示的Transformer(TransHP)。TransHP分三个步骤:1)学习一组提示令牌以表示粗略(祖先)类别,2)在中间块上即时预测输入图像的粗略类别,以及3)将预测的粗略类别提示令牌注入中间特征。尽管TransHP的参数对所有输入图像保持不变,注入的粗略类别提示条件(修改)了后续特征提取,并鼓励动态关注相对微小的子类之间的差异。大量试验表明,TransHP提高了图像分类的准确性(例如,提高了ViT-B / 16约2.83%的ImageNet分类准确性),训练数据效率(例如,在10%的ImageNet训练数据下提高了12.69%),以及模型可解释性。此外,TransHP还表现出色,显示其很好地利用了分层信息。