Transformers have been successfully used in various fields and are becoming the standard tools in computer vision. However, self-attention, a core component of transformers, has a quadratic complexity problem, which limits the use of transformers in various vision tasks that require dense prediction. Many studies aiming at solving this problem have been reported proposed. However, no comparative study of these methods using the same scale has been reported due to different model configurations, training schemes, and new methods. In our paper, we validate these efficient attention models on the ImageNet1K classification task by changing only the attention operation and examining which efficient attention is better.
翻译:然而,作为变压器核心组成部分的自我注意存在四重复杂问题,这限制了变压器在各种需要密集预测的远景任务中的使用,许多旨在解决这一问题的研究报告已经提出,然而,由于不同的模型配置、培训计划和新方法,没有报告使用相同规模的这些方法的比较研究。我们在我们的论文中只通过改变注意操作和检查哪些有效注意更好,验证了图像Net1K分类任务的这些高效注意模型。