Image resizing operation is a fundamental preprocessing module in modern computer vision. Throughout the deep learning revolution, researchers have overlooked the potential of alternative resizing methods beyond the commonly used resizers that are readily available, such as nearest-neighbors, bilinear, and bicubic. The key question of our interest is whether the front-end resizer affects the performance of deep vision models? In this paper, we present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer. MULLER has a bandpass nature in that it learns to boost details in certain frequency subbands that benefit the downstream recognition models. We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost. Specifically, we select a state-of-the-art vision Transformer, MaxViT, as the baseline, and show that, if trained with MULLER, MaxViT gains up to 0.6% top-1 accuracy, and meanwhile enjoys 36% inference cost saving to achieve similar top-1 accuracy on ImageNet-1k, as compared to the standard training scheme. Notably, MULLER's performance also scales with model size and training data size such as ImageNet-21k and JFT, and it is widely applicable to multiple vision tasks, including image classification, object detection and segmentation, as well as image quality assessment.
翻译:图像调整操作是现代计算机视觉中的基本预处理模块。在深度学习革命期间,研究人员忽视了除了常见的可用调整器之外的另一种调整方法的潜力,例如最近邻居、双线性和双三次。我们感兴趣的关键问题是:前端调整器是否会影响深度视觉模型的性能?在本文中,我们提出了一种极其轻量级的多层Laplacian调整器,只有少数可训练参数,称为MULLER调整器。MULLER具有带通性质,它学习如何增强有益于下游识别模型的某些频率子带中的细节。我们表明,MULLER可以轻松地插入到各种训练流程中,并且它有效地提高了下游视觉任务的性能,成本几乎为零。具体而言,我们选择最先进的视觉Transformer,MaxViT,作为基线,并显示如果使用MULLER进行训练,MaxViT在ImageNet-1k上可以获得高达0.6%的top-1精度,同时享受36%的推理成本节省来实现类似的top-1精度。值得注意的是,MULLER的性能还随着模型大小和训练数据大小(如ImageNet-21k和JFT)的增长而扩展,并且它广泛适用于多个视觉任务,包括图像分类、目标检测和分割以及图像质量评估。