Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.
翻译:理解高分辨率图像仍然是多模态大语言模型(MLLMs)面临的一项重大挑战。近期研究通过将图像分割为较小的图像块,并利用预训练的检索增强生成(RAG)模型计算每个图像块与查询之间的语义相似度来解决此问题。随后选取最相关的图像块以定位目标对象并抑制无关信息。然而,这种基于图像块的处理方式可能导致完整对象被分割至多个图像块中,从而干扰语义相似度的计算。在我们的实验中,我们发现不同尺寸的对象图像块在不同分辨率下处理效果更佳。基于这一观察,我们提出了多分辨率检索-检测(MRD),一种无需训练的高分辨率图像理解框架。为应对对象被分割至不同图像块导致的语义相似度偏差问题,我们提出了一种多分辨率语义融合方法,该方法融合在不同分辨率下获得的语义相似度图,以生成更准确的语义信息并保持目标对象的完整性。此外,为实现目标对象在全局尺度上的直接定位,我们引入了一种开放词汇目标检测(OVD)模型,该模型采用滑动窗口方法识别对象区域。在不同MLLMs上进行的高分辨率图像理解基准测试实验证明了我们方法的有效性。