多重场景:单一空气图像中多层识别的大型数据集和基准 (MultiScene: A Large-scale Dataset and Benchmark for Multi-scene Recognition in Single Aerial Images)

Aerial scene recognition is a fundamental research problem in interpreting high-resolution aerial imagery. Over the past few years, most studies focus on classifying an image into one scene category, while in real-world scenarios, it is more often that a single image contains multiple scenes. Therefore, in this paper, we investigate a more practical yet underexplored task -- multi-scene recognition in single images. To this end, we create a large-scale dataset, called MultiScene, composed of 100,000 unconstrained high-resolution aerial images. Considering that manually labeling such images is extremely arduous, we resort to low-cost annotations from crowdsourcing platforms, e.g., OpenStreetMap (OSM). However, OSM data might suffer from incompleteness and incorrectness, which introduce noise into image labels. To address this issue, we visually inspect 14,000 images and correct their scene labels, yielding a subset of cleanly-annotated images, named MultiScene-Clean. With it, we can develop and evaluate deep networks for multi-scene recognition using clean data. Moreover, we provide crowdsourced annotations of all images for the purpose of studying network learning with noisy labels. We conduct experiments with extensive baseline models on both MultiScene-Clean and MultiScene to offer benchmarks for multi-scene recognition in single images and learning from noisy labels for this task, respectively. To facilitate progress, we make our dataset and trained models available on https://gitlab.lrz.de/ai4eo/reasoning/multiscene.

翻译：在翻译高分辨率航空图像时, 星景识别是一个根本性的研究问题。在过去几年里, 大多数研究的重点是将图像分类为一个场景类别, 而在现实世界的情景中, 更常见的是单个图像包含多个场景。因此, 在本文中, 我们调查一项更实用但探索不足的任务 -- -- 多层图像识别。为此, 我们创建了一个大型数据集, 叫做多层Scene, 由10万个未受限制的高分辨率空中图像组成。考虑到手动标注这些图像极为艰苦, 我们使用众包平台, 例如 OpenStretMap (OSM) 低成本的注释。然而, OSM 数据可能因不完全和不正确而受到影响, 而在图像标签中引入噪音。为了解决这个问题, 我们用视觉检查14000个图像并纠正其场景标签, 产生一组干净的附加说明的图像, 名为多层Sceen- Clean。有了这样的人工标签, 我们可以开发和评估多层网络, 使用干净的数据模型, 并且用经过培训的多层图像模型来学习。

相关内容