In the field of intelligent multimedia analysis, ultra-fine-grained visual categorization (Ultra-FGVC) plays a vital role in distinguishing intricate subcategories within broader categories. However, this task is inherently challenging due to the complex granularity of category subdivisions and the limited availability of data for each category. To address these challenges, this work proposes CSDNet, a pioneering framework that effectively explores contrastive learning and self-distillation to learn discriminative representations specifically designed for Ultra-FGVC tasks. CSDNet comprises three main modules: Subcategory-Specific Discrepancy Parsing (SSDP), Dynamic Discrepancy Learning (DDL), and Subcategory-Specific Discrepancy Transfer (SSDT), which collectively enhance the generalization of deep models across instance, feature, and logit prediction levels. To increase the diversity of training samples, the SSDP module introduces augmented samples from different viewpoints to spotlight subcategory-specific discrepancies. Simultaneously, the proposed DDL module stores historical intermediate features by a dynamic memory queue, which optimizes the feature learning space through iterative contrastive learning. Furthermore, the SSDT module is developed by a novel self-distillation paradigm at the logit prediction level of raw and augmented samples, which effectively distills more subcategory-specific discrepancies knowledge from the inherent structure of limited training data without requiring additional annotations. Experimental results demonstrate that CSDNet outperforms current state-of-the-art Ultra-FGVC methods, emphasizing its powerful efficacy and adaptability in addressing Ultra-FGVC tasks.
翻译:暂无翻译