Vision transformers are nowadays the de-facto choice for image classification tasks. There are two broad categories of classification tasks, fine-grained and coarse-grained. In fine-grained classification, the necessity is to discover subtle differences due to the high level of similarity between sub-classes. Such distinctions are often lost as we downscale the image to save the memory and computational cost associated with vision transformers (ViT). In this work, we present an in-depth analysis and describe the critical components for developing a system for the fine-grained categorization of plants from herbarium sheets. Our extensive experimental analysis indicated the need for a better augmentation technique and the ability of modern-day neural networks to handle higher dimensional images. We also introduce a convolutional transformer architecture called Conviformer which, unlike the popular Vision Transformer (ConViT), can handle higher resolution images without exploding memory and computational cost. We also introduce a novel, improved pre-processing technique called PreSizer to resize images better while preserving their original aspect ratios, which proved essential for classifying natural plants. With our simple yet effective approach, we achieved SoTA on Herbarium 202x and iNaturalist 2019 dataset.
翻译:视觉变异器目前是图像分类任务的脱法选择。 有两大类的分类任务, 精细的和粗粗的。 在细细的分类中, 需要发现细细的分类, 细细的分类中, 由于子类之间的高度相似性而有细微的差别。 当我们缩小图像以保存与视觉变异器(ViT)有关的内存和计算成本时, 这些区别往往会消失。 在这项工作中, 我们提出一个深入分析, 描述开发精细的草原植物分类系统的关键组成部分。 我们的广泛实验分析表明, 需要一种更好的增强技术, 以及现代神经网络处理更高维度图像的能力。 我们还采用了一个叫作PreSizer的改进前处理技术, 以更好地调整图像的原始比例, 而这已证明了对自然植物分类至关重要。 我们还采用了一个叫Convilalal变异器(ConVT)的系统结构, 这个结构与流行的视野变异器(ConVIT)不同, 可以处理更高分辨率的图像, 而不会爆炸记忆和计算成本。 我们还引入了一种新型的改进的预处理技术, 叫做PreSizerate(Prezer) 和计算技术, size), 并保存原始的图像的原始比对自然植物进行分类至关重要。