可缩放图像分类的“硬性注意” (Hard-Attention for Scalable Image Classification)

Can we leverage high-resolution information without the unsustainable quadratic complexity to input scale? We propose Traversal Network (TNet), a novel multi-scale hard-attention architecture, which traverses image scale-space in a top-down fashion, visiting only the most informative image regions along the way. TNet offers an adjustable trade-off between accuracy and complexity, by changing the number of attended image locations. We compare our model against hard-attention baselines on ImageNet, achieving higher accuracy with less resources (FLOPs, processing time and memory). We further test our model on fMoW dataset, where we process satellite images of size up to $896 \times 896$ px, getting up to $2.5$x faster processing compared to baselines operating on the same resolution, while achieving higher accuracy as well. TNet is modular, meaning that most classification models could be adopted as its backbone for feature extraction, making the reported performance gains orthogonal to benefits offered by existing optimized deep models. Finally, hard-attention guarantees a degree of interpretability to our model's predictions, without any extra cost beyond inference. Code is available at $\href{https://github.com/Tpap/TNet}{github.com/Tpap/TNet}$.

翻译：我们能否在不以不可持续的二次复杂程度进入输入比例的情况下利用高分辨率信息? 我们提议Traversal网络(TNet),这是一个新型的多规模硬度搜索结构,以自上而下的方式跨过图像空间,仅访问沿途访问最丰富的图像区域。 TNet通过改变观看图像地点的数量,在准确性和复杂性之间提供了一个可调整的权衡。我们比较了我们的模型,以图像网的硬度基线为对比,以更少的资源(FLOPs、处理时间和记忆)实现更高的准确性。我们进一步测试了我们的FMoW数据集模型,在这个模型中,我们处理的大小为896美元/times 896 px 的卫星图像,与同一分辨率运行的基线相比,其处理速度提高到2.5美元x的更快,同时实现更高的准确性。 TNet是模块化的,这意味着大多数分类模型可以被采纳为特征提取的主干线,使所报告的性能增益或可测量到现有优化深度模型所提供的效益。最后,加固度保证了我们模型预测的可解释性能度,而没有超出任何额外成本。 NAmb_Net/Net/comefus/ atsub_tump_tump_tub_tub_Tsub_QQQ_QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ