Despite recent competitive performance across a range of vision tasks, vision Transformers still have an issue of heavy computational costs. Recently, vision prompt learning has provided an economic solution to this problem without fine-tuning the whole large-scale models. However, the efficiency of existing models are still far from satisfactory due to insertion of extensive prompts blocks and trick prompt designs. In this paper, we propose an efficient vision model named impLicit vIsion prOmpt tuNing (LION), which is motivated by deep implicit models with stable memory costs for various complex tasks. In particular, we merely insect two equilibrium implicit layers in two ends of the pre-trained main backbone with parameters in the backbone frozen. Moreover, we prune the parameters in these two layers according to lottery hypothesis. The performance obtained by our LION are promising on a wide range of datasets. In particular, our LION reduces up to 11.5% of training parameter numbers while obtaining higher performance compared with the state-of-the-art baseline VPT, especially under challenging scenes. Furthermore, we find that our proposed LION had a good generalization performance, making it an easy way to boost transfer learning in the future.
翻译:尽管近年来视觉转换器在各种视觉任务上取得了竞争性的表现,但视觉转换器仍存在重计算成本的问题。最近,视觉提示学习提供了一种经济解决方案,而无需微调整个大规模模型。然而,现有模型的效率仍远未满意,原因是插入了广泛的提示块和技巧提示设计。在本文中,我们提出了一种高效的视觉模型,名为隐式视觉提示调整(LION),它受到稳定内存成本的各种复杂任务的深层隐式模型的启发。特别地,我们仅在预训练的主干两端插入两个平衡的隐式层,并使主干的参数保持不变。此外,我们根据抽奖假设,剪枝这两个层中的参数。我们的 LION 在各种数据集上的性能表现令人满意。特别地,我们的 LION 可以减少高达 11.5% 的训练参数数量,同时获得比最先进的基准 VPT 更高的性能,特别是在具有挑战性的场景下。此外,我们发现我们提出的 LION 具有良好的泛化性能,这使它成为未来提高传输学习的简单方法。