High-resolution images are prevalent in various applications, such as autonomous driving and computer-aided diagnosis. However, training neural networks on such images is computationally challenging and easily leads to out-of-memory errors even on modern GPUs. We propose a simple method, Iterative Patch Selection (IPS), which decouples the memory usage from the input size and thus enables the processing of arbitrarily large images under tight hardware constraints. IPS achieves this by selecting only the most salient patches, which are then aggregated into a global representation for image recognition. For both patch selection and aggregation, a cross-attention based transformer is introduced, which exhibits a close connection to Multiple Instance Learning. Our method demonstrates strong performance and has wide applicability across different domains, training regimes and image sizes while using minimal accelerator memory. For example, we are able to finetune our model on whole-slide images consisting of up to 250k patches (>16 gigapixels) with only 5 GB of GPU VRAM at a batch size of 16.
翻译:高分辨率图像在诸如自主驱动和计算机辅助诊断等各种应用中普遍存在,高分辨率图像在自主驱动和计算机辅助诊断等各种应用中十分普遍。然而,关于这些图像的培训神经网络在计算上具有挑战性,而且很容易导致即使在现代 GPU 上也出现超模错误。我们提出了一个简单的方法,即循环补丁选择(IPS),将记忆用量与输入大小脱钩,从而能够在硬件紧紧的限制下处理任意的大型图像。IPS通过只选择最突出的补丁来达到这一点,然后将其汇总成图像识别的全球代表。对于补丁选择和聚合来说,都采用了基于交叉注意的变压器,显示与多例学习有着密切的联系。我们的方法表现良好,在使用最小的加速器记忆的同时在不同领域、培训制度和图像大小具有广泛适用性。例如,我们能够微小地微调整由最多250千个补丁( > 16千兆像素)组成的全流图像模型,只有5GBGPU VRAM 16个批尺寸的整块。